PROCEEDING IDEAS'2025 PDF Free Download

1 / 152
0 views152 pages

PROCEEDING IDEAS'2025 PDF Free Download

PROCEEDING IDEAS'2025 PDF free Download. Think more deeply and widely.

PROCEEDING
IDEAS2025
SARAH BENZIANE
KARIMA BELMABROUK
DEKHICI LATIFA
M’HAMED BILAL ABIDINE
Oran Algeria
National Conference on
Innovation of Data Engineering
and Artificial Intelligence
11-12 June 2025
University of Science and
Tehnologuy of Oran-
Mohamed Boudiaf
IDEAS 2025 Conference Committees
CHAIRS AND EXECUTIVE COMMITTEES
HONORARY CHAIRS
Prof. Ahmed Hamou, Rector of the University
Prof. Bachir Djebbar, Dean of the Faculty
CONFERENCE CHAIR
Sarah Benziane, USTO MB University
CO-CHAIR
Karima Belmabrouk, USTO MB University
STEERING COMMITTEE
Asmaa Boughrara, USTO MB
Asmaa Ourdighi, USTO MB
Karima Belmabrouk, USTO MB
Latifa Dekhici, USTO MB
Redouane Tlemsani, USTO MB
Sarah Benziane, USTO MB
Souad Ougouti, USTO MB
PUBLICATION COMMITTEE
Redouane Tlemsani, USTO MB
Sarah Benziane, USTO MB
POSTER COMMITTEE
Asmaa Boughrara
Asmaa Ourdighi
Souad Ougouti
ORGANIZATION COMMITTEE
Amina Medjahed, USTO MB
Asmaa Boughrara, USTO MB
Asmaa Ourdighi, USTO MB
Bouchiba Guellta, USTO MB
Chemseddine Choucha, USTO MB
Imad Eddine Khiloun, USTO MB
Latifa Dekhici, USTO MB
Mohamed Khaldi, USTO MB
Redouane Tlemsani, USTO MB
Salim Alachaher, USTO MB
Sarah Benziane, USTO MB
Souad Ougouti, USTO MB
PROGRAM COMMITTEE
Abdelatif Hassini, University of Oran 2
Abdelatif Rahmoun, ESI SBA
Abdelfettah ZEGHOUDI, University of Laghouat
Abdelghani Djebbari, University of Tlemcen
Abdelkader Benyettou, University of Relizane
Abdelkrim Souahlia, University of Djelfa
Abdelmadjid Allali, University of Chlef
Abderrahmane Bendahmane, USTO MB
Ahmed Roumanie, ENSTTIC Oran
Amel Djebbar, ENSE Oran
Amine Dahane, University of Oran 1
Asmaa Boughrara, USTO MB
Asmaa Ourdighi, USTO MB
Badra Khellat Kihel, University of Oran 2
Bouabdellah Kechar, University of Oran 1
Boudjelal Meftah, University of Mascara
Dounia Yedjour, USTO MB
Fatiha Guerroudji, USTO MB
Hachem Slimani, Univesity of Bejaia
Hadria Fizazi, USTO MB
Hafid Haffaf, University of Oran 1
Hafidha Bouziane, USTO MB
Hamza Bousbaa, ENPO
Hayat Bendoukha, USTO MB
Hayat Yedjour, USTO MB
Hicham Reguieg, USTO MB
Karima Belmabrouk, USTO MB
Karima Kies, USTO MB
Khadidja Belbachir, USTO MB
Khaled Belkadi, USTO MB
Latifa Dekhici, USTO MB
Lila Medebber, USTO MB
Mahmoud Zennaki, USTO MB
2
Malika ALLALI, URERMS ADRAR
M’hamed Abidine, USTHB
Mohamed Dahmani, USTO MB
Mohammed Amine Chikh, University of Tlemcen
Mostefa Benhaliliba, USTO MB
Mourtada Benazouz, University of Tlemcen
Naciima Mellal, University of Oum Bouaghi
Rachida Ghoul Hadiby, USTO MB
Reda Adjoudj, University of Sidi Bel Abbes
Redouane Tlemsani, USTO MB
Saber Harzallah, University of Batna
Samir Benbekriti, ENSTTIC Oran
Sarah Benziane, USTO MB
Sarah Maroc, USTO MB
Sarah Nait Bahloul, USTO MB
Selma Khouri, ESI Alger
Souad Ougouti, USTO MB
Souad khellat, USTO MB
Soufiane Boukli Hacene, University of Sidi Bel Abbes
Yasmina Hernane, USTO MB
3
FOREWORD
We are thrilled and proud to present the proceedings of the First National Conference on Innovation
in Data Engineering and AI Science (IDEAS 2025), held in the vibrant city of Oran, Algeria, and
hosted by the Université des Sciences et de la Technologie d’Oran Mohamed Boudiaf (USTO-
MB).
IDEAS 2025 was designed as a premier forum to bring together researchers, academics, students,
and industry professionals from across the country to explore innovations and address the latest
challenges in data engineering and AI science. The fast pace of technological advancement in these
domains calls for collaboration, critical discussion, and robust idea exchange we believe this
conference achieved all these aims.
We were impressed by both the number and the quality of submissions we received. After a
thorough review by the Program Committee, the papers accepted for inclusion in these proceedings
represent the very best original research in these innovative fields. The accepted papers span
contemporary themes such as Big Data Analytics, Machine Learning and Deep Learning
Applications, Data Quality Management, Intelligent Systems Architectures, Ethical AI, and more.
Together, they highlight ongoing research, future directions, and the significant impact these fields
are poised to make.
The successful organization of IDEAS 2025 would not have been possible without the support and
engagement of many individuals and institutions. We extend our deepest gratitude to our
Honourary Chairs, Prof. Ahmed Hamou (Rector) and Prof. Bachir Djebbar (Dean), for their
continuous support and institution-wide sponsorship. A special thank-you goes to the dedicated
members of the Program Committee and the outstanding Organizing Committee, whose
commitment ensured a smooth and engaging event. I would also like personally to thank Co-Chair
Karima Belmabrouk and all members of the Steering and Publication Committees for their
collective effort and dedication in upholding a high standard of academic rigor for these
proceedings.
We are confident that these proceedings will serve as a valuable resource for researchers and
practitioners alike, spur new research questions, and foster collaborative partnerships. We look
forward to seeing the impact of this work and are hopeful that IDEAS 2025 marks the beginning
of a continuing series of enriching scientific gatherings.
We thank all the authors for their excellent contributions and all participants for their enthusiastic
attendance.
Sarah Benziane,
Karima Belmabrouk
Conference Chair, IDEAS 2025
USTO MB University, Oran, Algeria
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
An MFU-based Approach for Data Quality
Management in NoSQL Document-oriented
Databases
Aicha Aggoune#*1
1aggoune.aicha@univ-guelma.dz
#Computer science Department, University of 8th May 1945
Guelma, Algeria
*LabSTIC Laboratory, University of 8th May 1945
Guelma, Algeria
Abstract NoSQL databases offer a flexible schema for
representing data, unlike the rigid schema of relational
databases. However, this flexibility can introduce challenges in
data quality. In this paper, we examine NoSQL document-
oriented databases, especially MongoDB and propose an
approach based on the Most Frequently Used (MFU) method to
detect and repair data quality issues by leveraging the most
frequently used elements. The proposed approach deals with
three data quality issues: schema overlap, missing data; and data
redundancy. Experimental evaluations on real-world MongoDB
datasets demonstrate that our MFU-based method effectively
enhances data consistency and completeness while reducing
redundancy. This work provides a practical framework for
improving data quality in NoSQL environments, ensuring more
reliable data for analytical and operational tasks.
Keywords MongoDB, NoSQL document-oriented database,
Data quality, MFU.
I. INTRODUCTION
Data quality is a major challenge for organizations,
impacting both data analysis performance and financial
planning. High-quality data enables companies to enhance
operational efficiency, improve customer satisfaction, and
maintain a competitive edge by swiftly adapting their business
strategies. Joseph M. Juran [1] defines data quality as follows:
“data to be of high quality if they are fit for their intended uses
in operations, decision-making and planning”. Larry P.
English [2], an early leader in data quality management,
defined data quality as follows:
The ability to consistently fulfill the expectations of
knowledge workers and end customers across all quality
aspects of information products and services necessary
for achieving organizational goals.
The extent to which data reliably meets the needs and
expectations of knowledge workers who depend on it
for their tasks.
In addition, Wang and Strong [16] define a "data quality
dimension" as a set of data quality attributes that represent a
single aspect or construct of data quality. They present four
categories of dimensions:
- Intrinsic DQ denotes that data have quality in their
own right. It includes accuracy, believability,
objectivity, and reputation,
- Contextual DQ highlights the requirement that data
quality must be considered within the context of the
task at hand; that is, data must be relevant, timely,
complete, and appropriate in terms of amount so as to
add value,
- Representational DQ, includes aspects related to the
format of the data {concise and consistent
representation) and meaning of data (interpretability
and ease of understanding), and
- Accessibility consists of accessibility and access
security.
In the context of NoSQL databases, the flexibility of
schemas means that data can be stored without a predefined
structure, allowing dynamic and varying formats within the
same database [10]. This enables developers to modify
document structures on the fly and handle diverse data types
without strict constraints. However, it also introduces
challenges in data consistency, quality management, and
query optimization. Our primary focus is on data quality
management in MongoDB, one of the most widely used
NoSQL databases, where data is stored in collections of JSON
documents. We examine three key data quality issues in
MongoDB: schema overlap, which impacts the concise
representation and representational consistency of document-
oriented NoSQL data; completeness, which relates to the
challenge of missing values; and conciseness, which addresses
data redundancy.
To tackle these challenges, we propose an approach based
on the Most Frequently Used (MFU) method. This approach
identifies the most commonly used schema elements and data
values to detect inconsistencies and enhance data quality. By
leveraging MFU analysis, our proposal improves schema
standardization, reduces redundancy, and ensures data
completeness.
Furthermore, we implement and evaluate our approach
using real-world datasets to demonstrate its effectiveness in
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
improving data quality in NoSQL document-oriented
databases.
The remainder of this paper is structured as follows:
Section 2 reviews related work on data quality management in
NoSQL databases. Section 3 presents our proposed approach,
detailing the MFU-based approach for detecting and resolving
data quality issues. Section 4 discusses the experimental
results, including the implementation of our data quality
management tool for MongoDB and key findings from our
experiments. Finally, Section 5 concludes the paper by
summarizing key insights and outlining potential future
research directions.
II. RELATED WORK
Many studies and frameworks have been developed for
managing data quality in relational databases and linked open
data [3], [4]. These areas have been extensively researched
due to the structured nature of relational data and the growing
need for semantic web technologies. However, while research
on NoSQL data quality is less extensive compared to
relational databases, it has been gaining attention in recent
years.
In the context of NoSQL columnar databases, several
approaches have been proposed. For example, Frozz et al.
[13], suggested replicating the hierarchical structure into a
new namespace and analyzing each column to infer data
types. In contrast, Bouhamoum et al. [14] explored schema
discovery and entity linkage in RDF data sources through
clustering techniques, operating without predefined schema
information.
In our work, we shift the focus to document-oriented
databases such as MongoDB, which are widely used but
present challenges related to maintaining data consistency and
integrity [15], [17].
Störl et al. [5] present a middleware called Darwin, which
enables the extraction of a NoSQL schema description, the
discovery of schema version history, and the proposal of
mappings between these versions. The databases supported by
Darwin include MongoDB and CouchDB. Darwin consists of
four main components: Schema Extraction Manager, Schema
Evolution Manager, Data Migration Manager, and Query
Rewriting Manager. Schema quality control is based on
detecting various operations applied to the schema, such as
adding a new attribute, renaming an attribute, and other
modifications.
An extension of Darwin [6] focuses on schema evolution,
which relies on five key operations: add, delete, rename, copy,
and move. These operations serve as the foundation for
maintaining data quality control.
Cristalli et al. [7] proposed a data quality control
framework for MongoDB databases, based on data quality
dimensions such as consistency, completeness, and
conciseness. Their framework relies on predefined quality
rules for each dimension and includes a process for data
quality verification. This approach can help organizations
enhance data quality and improve the reliability of analyses
based on MongoDB data.
Conrad et al. [8] propose a framework for evaluating
NoSQL systems with evolving schemas. It consists of two
main components: json-data-generator, which generates a
JSON file for evaluation, and EvoBench runner, which
assesses the performance of schema evolution in document-
oriented NoSQL databases like MongoDB and column-
oriented databases like Cassandra.
Möller et al. [11] addressed the challenge of data quality
management in NoSQL databases during their evolution. As
NoSQL collections may contain datasets in multiple versions,
they often become heterogeneous over time. To handle this,
the authors proposed four heterogeneity classes based on
schema evolution operationsranging from highly structured
datasets with 1:1 cardinalities to unstructured datasets with
arbitrary cardinalities. Data quality is assessed across three
key dimensions: completeness, actuality, and consistency.
Asaad et al. [12] presented ER-defined quality framework
to NoSQL on a sample of diverse NoSQL schemas and using
both industrial and academic participants. A decision tree is
utilized to describe the heuristics of data model assessment,
and an analysis is performed to identify inter-annotator
disagreement, quality criterion importance, and quality trade-
offs.
Despite these advancements, research on NoSQL data
quality remains relatively underdeveloped compared to that of
relational databases. Existing studies primarily focus on
managing data quality during schema evolution, often without
providing detailed methodologies or extensive experimental
evaluations. Moreover, there is still a need for more
comprehensive solutions capable of addressing dynamic
schema changes, large-scale data validation, and real-time
quality monitoring.
III. PROPOSED MFU-BASED APPROACH FOR DATA QUALITY
MANAGEMENT
We present a document-oriented data quality management
approach based on the Most Frequently Used (MFU)
elements.
We define document-oriented data quality based on three
key dimensions: representational consistency, concise
representation, and completeness. These dimensions influence
other quality aspects, such as accuracy, conciseness,
coherence, and ease of understanding.
Our approach studied three important data quality
dimensions:
Representational DQ: Identified the correct schema by
analysing the most frequently used schema across
documents within a collection. The documents with
missing attributes required by the selected schema
should be updated by adding these attributes with null.
Contextual DQ: especially the completeness. After
resolving schema overlap and selecting the correct
schema, the second issue is the incompleteness data.
Our approach provides multiple imputation strategies
for missing values:
1. Mean imputation: replacing missing values with
the average.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
2. Mode imputation: replacing missing values with
the most frequently used value.
3. Removing of documents containing missing
data.
Concise representation: Occurs when the frequency
of a data entry exceeds one. The correction involves
removing duplicate documents from the collection.
A. Data quality detection
Data quality detection is the detection of data issue. We
distinguish three types of data issues:
Schema overlaps: In NoSQL databases, the order and
number of fields (or attributes) are not important, which
can lead to poor data quality and create challenges in
data manipulation and query interpretation. The MFU
method examines documents within a collection to
identify schema inconsistencies. If different schemas
are detected, it displays each schema's frequency.
Data incompleteness: The MFU method analyzes the
list of documents in the collection, identifies those with
missing attributes based on the most frequently used
schema, and proceeds with the correction process.
Data Duplication: The MFU method analyzes the
documents in the collection and detects redundancy
based on the frequency of document occurrences
(excluding the unique document ID). It then displays
the duplicate documents. If duplicates are found, MFU
proceeds with the correction process.
B. Data repairing
Data repair involves applying corrective actions to data
after identifying quality issues.
Schema Overlap: The repairing of schema overlap involves
using the results of the overlap detection process, which
generates a set of (schema, frequency) pairs. The schema with
the highest frequency is selected as the correct schema. Then
we apply this pivot schema to our database as follows:
Analyze the collection and apply the selected schema to
each document.
If an attribute exists in a document, retrieve its value
and insert it into the new collection.
If an attribute is missing, replace it with 'Null'.
Data incompleteness: Several approaches have been
proposed to address data incompleteness:
Removing documents with missing values: This
approach is suitable when the percentage of missing
attributes in a document exceeds a predefined threshold,
ensuring minimal information loss while maintaining
data quality
Imputation of Missing Values: Missing values can be
replaced with a statistical measure such as the mean or
mode. To minimize bias, imputed values should closely
reflect the actual data distribution.
Data Duplication: One of the major challenges in
MongoDB document-oriented databases is data duplication.
To address this issue, we use the following process:
1. Select a collection from the database.
2. Retrieve a document (excluding its unique identifier).
3. Iterate through all documents in the collection and
check for duplicates by comparing them to the
previously retrieved document.
4. Repeat steps 2 and 3 for each document in the
collection.
5. Display the list of duplicate documents found.
6. The user decides whether to delete the duplicate
documents based on the results.
The MFU method enhances data quality in MongoDB through
a two-step process: detection and repair.
Detection: MFU identifies schema overlaps by analyzing
document structures and reporting schema frequencies. It
detects data incompleteness by comparing documents to the
most frequent schema and flags missing attributes. For
duplication, it compares document contents (excluding IDs) to
find redundant entries.
Repair: The most frequent schema is applied to standardize
documents existing values are preserved, and missing ones are
filled with Null. Incomplete records are either removed or
repaired using imputation (e.g., mean or mode). Duplicate
documents are listed for user review and optional deletion.
This approach ensures consistent, complete, and duplicate-free
data suitable for reliable analysis.
IV. EXPERIMENTAL EVALUATION AND RESULTS
Ensuring data quality after detection and correction is
crucial for evaluating the effectiveness of the proposed
method. Firstly, we developed a data quality management tool
for MongoDB. We ran our experiments on a standard Linux
machine equipped with a 2.4 GhZ dual core CPU, 8GB of
RAM and 350 GB of standard storage. The approach is
implemented in Python 3.10.4 whereas data was managed in
MongoDB. We use a real-world database of COVID19,
published on February 14, 2022, and provided by Johns
Hopkins University (JHU) [9]. According to MongoDB
developer Maxime Beugnet, the COVID-19 database serves as
an excellent dataset for educational purposes and personal
projects. Table 1 provides a description of the database used.
TABLE I
DESCRIPTION OF DATASET
Name of
database
Covid19
Number of
collections
3
Description
of collection
Beford
Benford_view
Dataset
1 document
193
documents
11000
documents
Total of
documents
11194 documents
The following figure displays the main interface of our
tool.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
Fig. 1 Data quality management tool
Our approach begins with the detection and repairing of
schema overlap. The result of data detection is illustrated in
Fig. 2.
Fig. 2 Schema overlap repair interface
The proposed approach is used to apply the most frequently
occurring schema to the collection.
Afterward, the approach scans the entire collection to
ensure that no schema inconsistencies remain (see Fig.3).
Fig. 3 Example of validation of data quality management
We follow the same steps for the other two issues, applying
the defined corrections accordingly.
We conducted both qualitative and quantitative evaluations.
The quality verification process assesses the reliability of
the repaired data by reintroducing the output of the MFU
method as input. If MFU effectively enhances data quality, the
final output should align with the expected clean data, thereby
validating the success of the repair process.
In the quantitative evaluation, we focus on the following
metrics:
1. Measure the number of documents that follow the
correct or consistent schema within the same
collection. This metric reflects the representational
dimension of data quality.
2. Measure the amount of missing data to evaluate the
completeness dimension of data quality.
3. Measure the number of duplicate records in the dataset
to assess the concise representation dimension of data
quality.
Table 2 provides the results for two collections Benford_view
(193 documents), Dataset (11000 documents).
TABLE III
EXPERIMENTAL RESULT
Collection
Measure
for
Dimension
1
Measure
for
Dimension
2
Measure for
Dimension 3
Benford_view
98.5%
0.1%
0%
Dataset
98.1%
0.5%
0%
The experimental results demonstrate the effectiveness of
the MFU-based approach in detecting and repairing data
quality issues in MongoDB. By applying the most frequently
used schema, handling schema overlap, missing values, and
eliminating duplicate documents, our method improves data
consistency, completeness, and conciseness. The verification
process confirms that the repaired data aligns with expected
quality standards, validating the reliability of the proposed
approach.
V. CONCLUSION
We proposed a data quality management approach for
document-oriented NoSQL databases, focusing on schema
overlap, data incompleteness, and duplication issues. Using
the Most Frequently Used (MFU) method, our approach
detects and corrects these issues by identifying the most
common schema, imputing missing values, and removing
redundant records.
Through experimental evaluation on a real-world dataset,
we demonstrated the effectiveness of our method in improving
data consistency, completeness, and conciseness. The
verification process confirmed the reliability of the repaired
data, ensuring that the proposed approach enhances data
quality in MongoDB databases.
Future research should investigate machine learning
techniques and automated schema evolution tracking to
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
improve data quality in NoSQL systems, particularly for
large-scale datasets.
REFERENCES
[1] J.M. Juran, F.M. Gryna, and R.S. Bingham, Quality control handbook,
New York: McGraw-hill, 1979, vol. 3.
[2] L.P. English, Improving data warehouse and business information
quality: methods for reducing costs and increasing profits. John Wiley
& Sons, Inc, 1999.
[3] I. F. Ilyas and X. Chu, Trends in cleaning relational data: Consistency
and deduplication”, Found. Trends Databases, vol. 5, no.4, pp.281-
393, 2015.
[4] A. Hadhiatma, Improving data quality in the linked open data: a
survey, Journal of Physics: Conference Series, vol. 978, no. 1,
pp.012026, 2018.
[5] U. Störl, D. Müller, A. Tekleab, S. Tolale, J. Stenzel, M. Klettke, and
S. Scherzinger, Curating variational data in application development,
IEEE 34th International Conference on Data Engineering (ICDE), pp.
1605-1608, 2018.
[6] M. L. Möller, M. Klettke, and U. Störl, Keeping nosql databases up to
data semantics of evolution operations and their impact on data
quality, 2019.
[7] E. Cristalli, F. Serra, and A.Marotta, Data quality evaluation in
document oriented data stores, In Advances in Conceptual Modeling:
ER 2018 Workshops Emp-ER, MoBiD, MREBA, QMMQ, SCME, pp.
309-318, Springer International Publishing, 2018.
[8] A. Conrad, M.L. Möller, T. Kreiter, J.C. Mair, M. Klettke, and U.
Störl, EvoBench: Benchmarking schema evolution in NoSQL.
In Performance Evaluation and Benchmarking: 13th TPC Technology
Conference, TPCTC 2021, Copenhagen, Denmark, August 20, 2021.
[9] (2022) The MongoDB website. [Online]. Available:
https://www.mongodb.com/developer/article/johns-hopkins-university-
covid-19-graphql-api/
[10] A. Aggoune, “An Overview on the Mapping Techniques in NoSQL
Databases”, Int. J. Inf. Appl. Math., vol. 3, no. 2, pp. 5365, 2020.
[11] M.L. Möller, D. Hausler, S. Strasser, T. Auge, and M. Klettke et al.,
Heterogeneity in NoSQL Databases-Challenges of Handling schema-
less Data. LWDA. pp. 134-145, 2023.
[12] C. Asaad, K. Baïna, M. Ghogho, Investigating the Perceived Usability
of Entity-Relationship Quality Frameworks for NoSQL Databases. In:
Mosbah, M., Kechadi, T., Bellatreche, L., Gargouri, F. (eds) Model and
Data Engineering. MEDI 2023. Lecture Notes in Computer Science,
vol 14396. Springer, Cham.
[13] A.A. Frozza, E.D. Defreyn and, R. Mello, A process for inference of
columnar NoSQL database schemas, in Anais do Simpósio Brasileiro
de Banco de Dados (SBBD). Anais do XXXV Simpósio Brasileiro de
Bancos de Dados, SBC, pp. 175180, 2020.
[14] R. Bouhamoum et al.; “caling up schema discovery for RDF datasets”,
in 2018 IEEE 34th International Conference on Data Engineering
Workshops (ICDEW). 2018 IEEE 34th International Conference on
Data Engineering Workshops (ICDEW), pp. 8489, 2018.
[15] A. Aggoune, and M.S. Namoune, Metadata-driven data migration from
object-relational database to nosql document-oriented database,
Comput. Sci, vol.23, 2022.
[16] R. Y. Wang and D. M. Strong, “Beyond Accuracy: What Data Quality
Means to Data Consumers”, J. Manag. Inf. Syst, vol.12, no.4, pp. 533,
1996.
[17] A. Aggoune and M.S. Namoune, P3 Process for Object-Relational
Data Migration to NoSQL Document-Oriented Datastore”, Int. J.
Softw. Sci. Comput. Intell, vol.14, no.1, pp. 1-20, 2022.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
Demand-aware drug assignment in manipulator
arm automated dispensing systems via graph
convolutional network ranking
Yassine Bouhelassa #1, Khalid Hachemi 2
1,2LGPMI, Institute of Maintenance and Industrial Safety, University of Oran 2
Mohamed Ben Ahmed, B.P1015 El M'naouer 31000 Oran, Algeria
1bouhelassayassine777@gmail.com
3hachemi.khalid@univ-oran2.dz
Abstract This paper introduces a GCN-based ranking model to
optimize drug placements within Automated Drug Dispensing
Systems (ADDS). Our method defines drugs as nodes that
incorporate features extracted from their monthly consumption
frequency, linking drugs in an interdependence graph through a
co-use matrix. The graph convolutional network uses message-
passing procedures to derive abstract representations that reveal
drug demand behaviour and co-prescription relationships before
generating drug scoring values. An optimized placement matrix is
then constructed by applying these ranking scores to position
high-value drugs in central compartment locations of a
manipulator arm-based ADDS. The optimized placement
configuration resulted in a 27% reduction in average retrieval
times compared to random drug positioning approaches.
Additionally, t-SNE analysis of the drug embeddings produced
meaningful clusters corresponding to drug relevancy. This
approach is fully adaptive to different ADDS systems,
demonstrating its potential for operational improvements in
healthcare facilities.
Keywords Artificial intelligence, Graph Neural Networks,
Automated Drug Dispensing Systems, Graph Convolutional
Networks, Healthcare logistics, Drug assignment Optimization.
I. INTRODUCTION
Modern healthcare achieves better drug retrieval efficiency
through Automated Drug Dispensing Systems (ADDS) which
minimize human error in pharmaceutical management.
Automated drug distribution systems deliver drugs up to 94%
faster than traditional manual methods [1] . However, one of
the primary challenges in ADDS is the optimal arrangement of
medications, as utilization rates vary among products and the
combinations of drugs in prescriptions differ.
In this work, we present a manipulator-arm-based automated
drug dispensing system that offers improved accuracy in drug
retrieval operations (Fig 1). Medications are stored in
standardized bins within a rack system, and a robotic arm
follows optimized routes to retrieve drugs efficiently. We
employ a Graph Convolutional Network (GCN) to implement
an effective placement strategy by learning latent
representations of drugs based on their monthly consumption
frequencies and co-usage relationships. The GCN calculates
drug retrieval priority scores, which are then used to position
high-priority drugs in the most accessible compartments
specifically, in the leftmost columns and bottom rowsthereby
reducing overall retrieval times.
Our proposed system builds upon recent advances in graph-
based ranking [2], [3], [4], [5]. By adapting these techniques to
the ADDS location assignment task, our approach achieves
significantly faster mean retrieval times and demonstrates
enhanced performance and consistency in drug dispensing
operations.
The remainder of this paper is organized as follows. Section
2 reviews related literature on storage optimization and graph-
based ranking methods. Section 3 formalizes the notation and
introduces Graph Neural Networks as the foundation of our
approach. Section 4 describes our research methodology,
including data preparation, graph construction, the GCN-based
ranking model, the design of the optimized placement matrix,
and the ADDS retrieval time model. Section 5 presents our
experimental results, which include comparisons with random
placement strategies and visualizations of drug embeddings.
Finally, Section 6 concludes the paper and outlines directions
for future work.
II. LITERATURE REVIEW
Healthcare facilities must properly arrange their medications
in automated drug dispensing systems because of the critical
Storage Location Assignment Problem (SLAP). Optimizing
storage positions in current procedures supports safe
medication delivery systems and boosts workplace
effectiveness [6]. Research by reyes et al shows that the main
problem is matching different medication distribution methods
with better retrieval systems and space usage practices [7].
Hausman et al [8] and Roodbergen et Vis [9] discovered the
main storage methods through their research foundation.
Random storage puts products anywhere in the ADDS to save
space yet increases the path to physical items. Products with
fixed storage locations get easy identification but leave unused
space when certain items remain understocked. Opposite
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
products in the class-based storage method follow approved
criteria to create clear zones that help staff retrieve medicines
faster with fewer errors [10]. The cube-per-order index (COI)
method was created to help achieve storage space balance and
better product picking speed levels.
Fig 1 Internal structure of manipulator arm Automated Drug Dispensing
System [11]
A considerable body of research has applied mathematical
models to storage optimization problems. Atmaca and Ozturk
[12] established models to manage inventory costs but these
methods performed well in controlled scenarios and had limited
application scale. Chaker and Khalid [13] created distribution
patterns that placed different types of medicines far apart to
help hospital workers avoid errors when picking orders.
Esmaili et al [14]. created two mixed integer programming
models to optimize warehouse performance first and secondly
considering product placement requirements and proved
superior to prior methods according to their research published
in 2018.
Advanced mathematics produces better results for this work.
The authors Hachemi and Alla [15] developed Petri nets to
solve safety protocols and optimize storage space without
exceeding drawer capacity. Hachemi and Amari revealed their
Min-Plus control method along with mathematical proof to
solve the SLAP while ensuring drug distribution effectiveness.
Recent research by Bouhelassa et al. examined normal and
fuzzy analytic hierarchy methods to enhance how well
medicines are stocked each month [16].
III. PRELIMINARIES
We begin by formalizing our notation for Information
Retrieval (IR) and Graph Neural Networks topics. In our
notation, bold symbols denote both matrices and vectors, with
uppercase letters representing matrices and lowercase letters
representing vectors (e.g., M, v). Scalars are denoted by simple
italic letters.
Let 󰇛󰇜 be an undirected graph, where is the
set of nodes and  is the set of edges. The number of
nodes in is denoted by , and the number of edges by
The neighbourhood of a node , 󰇛󰇜has cardinality
󰇛󰇜 The diagonal matrix  is defined such that its
 diagonal element is 󰇛󰇜.
The feature vector of node is represented by .
These node representations are arranged into an instance matrix
, defined as:

The set of edges can also be expressed as the adjacency
matrix 󰇛󰇜, where  if nodes 󰇛󰇜 are
connected, and  otherwise.
A. Graph Neural Networks
Graph Neural Networks (GNNs) are designed to learn and
extract features from graph-structured data. Given a collection
of graphs 󰇝󰇞
, where each graph 󰇛󰇜
has an instance matrix  and an adjacency matrix
, GNNs utilize a message passing formalism to
extract features from these datasets [17].
In this formalism, the instance matrix for  is
iteratively updated through a series of layers during the forward
pass. The intermediate representation at layer (l) is denoted by
󰇛󰇜for, with the initial representation given by
󰇛󰇜After layers, we obtain a latent feature matrix
󰇛󰇜󰆓for graph . Given a set of weights and
parameters for each layer (l), the message passing update
rule is expressed as:
󰇛󰇜󰇛󰇜󰇛󰇜 
where and  represent the update and
aggregation functions, respectively. Several message passing
schemes exist that differ in the choice of these functions [18],
[19], [20], [21]. The new node representations, denoted as ,
obtained from this process are then used for downstream tasks
such as node classification, graph classification, or link
prediction.
B. Drug Ranking and Re-ranking
In our study, rather than a traditional information retrieval
scenario involving queries and documents, our objective is to
rank drugs based on their monthly consumption frequencies
and co-usage relationships. Instead of retrieving documents, we
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
assess drugs to determine which should be prioritized for
placement in the ADDS.
We adopt a two-stage ranking approach. First, an initial
ranking is generated using intrinsic drug features. Let
denote the monthly consumption frequencies for
drugs, forming the feature matrix  where .
Additionally, the co-usage matrix  captures pairwise
co-usage relationships; an edge exists between drugs and if
, with the edge weight set to 
In the second stage, this initial ranking is refined by
employing a Graph Convolutional Network (GCN) that
aggregates local neighbourhood information via message
passing. Specifically, the first GCN layer computes:
󰇛󰇜ReLU 󰇡GCN
󰇛󰇜󰇢
(1)
and the second layer refines these representations:
󰇛󰇜

󰇡

󰇛󰇜󰇛󰇜󰇢
(2)
Finally, a fully connected layer projects these refined
representations to a scalar ranking score for each drug:

󰇛󰇜
(3)
for . These scores capture both the intrinsic
importance of a drug and the influence of its co-usage
relationships.
The optimized placement matrix is then constructed by sorting
drugs in descending order of and assigning them to the most
accessible compartments in the ADDS. Specifically, the
leftmost columns minimize horizontal travel distance to the
retrieval station, while the bottom rows leverage gravitational
acceleration for faster vertical retrieval in the Free-Fall Flow
Rack design. This two-stage approach effectively refines the
initial ranking and results in a placement strategy that
minimizes retrieval times.
IV. METHODOLOGY
C. Problem Definition
The Automated Drug Dispensing System (ADDS) is a
cutting-edge solution designed to enhance the efficiency of
pharmaceutical storage and retrieval operations. In our
manipulator-arm-based ADDS, medications are stored in a rack
divided into discrete storage bins. Each bin has uniform
dimensions with a fixed length and height [11]. Drugs are
placed in these bins, and a robotic arm, operating from
designated input/output (I/O) points, retrieves medications by
following an optimal route determined by an intelligent control
algorithm.
The retrieval process begins at an I/O point, from which the
robotic arm travels to the bin containing the required drug and
then returns to the I/O point. For single-command operations,
the travel time is calculated as the maximum of the
horizontal and vertical travel times between the I/O point and
the target bin. For dual-command operationswhere the arm
retrieves two bins sequentiallythe travel time is computed as
the sum of the times for each leg of the journey. These
calculations assume a constant speed and ignore acceleration,
deceleration, and loading/unloading durations.
Our goal is to optimize the assignment of drugs to storage
bins such that overall retrieval times are minimized. This
optimization considers two key data sources, monsual
consumption and Co-Use Matrix.
Prescription data, generated based on these two data sources,
is used to evaluate the performance of the placement strategy.
The objective is to assign drugs to the most accessible
compartments specifically, the bins in the leftmost columns and
the bottom rowsso that frequently used and commonly co-
prescribed medications can be retrieved quickly. The problem
is constrained by the uniformity of bin dimensions, the rule that
each bin holds only one drug, and the predefined movement
model of the robotic arm.
A. Data preparation
Our dataset comprises three key components:
Drug Frequencies: Monthly consumption data
indicating how often each drug is used.
Co-Use Matrix: A matrix with values between 0 and
1 that quantifies the co-usage relationships between
drugs.
Prescription Data: Generated based on the drug
frequencies and co-use matrix, each prescription lists
the drugs that are to be retrieved.7
To evaluate the performance of the optimized drug
placement, simulated prescription data is generated to mimic
realistic usage scenarios. A total of 200 prescriptions are
synthesized, with each prescription containing 3 distinct
medications. The generation algorithm uses the normalized
consumption frequencies of the drugs. If represents the
consumption frequency of drug , then the normalized
frequency is given by:

(4)
This determines the probability of selecting drug \(i\) as the first
medication:
󰇛󰇜
(5)
For subsequent drugs, the selection probability is adjusted
based on the co-use matrix :
󰇛
󰇜

(6)
with being the set of drugs already selected. This method
ensures that drugs with strong co-usage relationships are more
likely to appear together in the generated prescriptions.
Performance is evaluated based on key metrics derived from
the optimized placement, which will be detailed in the Results
section. These data sources are preprocessed and normalized to
ensure consistency before further analysis.
A. Graph Construction
In our study, the graph is constructed to represent the
relationships among 150 drugs. Each drug is modeled as a node,
and its feature vector is derived from its monthly consumption
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
frequency. Specifically, let denote the frequency of drug ;
then the feature vector for drug is simply .
The co-use matrix, a  matrix, quantifies the co-
usage relationship between drugs. An edge is established
between drug and drug if the corresponding entry in the co-
use matrix, is greater than zero. The edge weight is set to 
which reflects the strength of the co-usage relationship. Since
the matrix is sparse, only significant co-usage relationships are
represented as edges.
The resulting graph 󰇛󰇜 is undirected and does
not include self-loops. Here,
is the set of nodes (drugs), with .
 is the instance matrix, where each row
corresponds to the consumption frequency of a drug.
is the set of edges defined by the nonzero entries in
the co-use matrix.
 is the weighted adjacency matrix, where
 󰇜󰇛 󰇜and  otherwise. This
preserves the co-usage strength between drugs.
This graph structure captures the underlying relationships
among drugs, which is crucial for the subsequent learning of
effective representations using Graph Neural Networks [17]
[18], [19], [20], [21]. This graph construction method is fully
scalable and can be applied consistently regardless of the
ADDS dimensions.
B. GCN-Based Ranking Model
We employ a Graph Convolutional Network (GCN) [19] to
learn latent representations of drugs and compute their ranking
scores. In our model, the input feature matrix (derived from
the monthly consumption frequencies) is combined with the
weighted edge structure from the co-use matrix. The GCN
aggregates local neighborhood information through a series of
message passing layers, enabling each drug's representation to
be informed by its own frequency as well as by the co-usage
relationships with its neighbors.
Specifically, the first graph convolutional layer applies a linear
transformation followed by a nonlinear activation (ReLU) to
produce an intermediate representation 󰇛󰇜 This layer
aggregates features from neighboring nodes, weighted by the
co-use values, as in equation (1) A second graph convolutional
layer further refines these representations (equation (2)) Finally,
a fully connected layer projects the refined representations to a
scalar ranking score for each drug (equation (3)). These ranking
scores, which capture both the intrinsic drug frequency and the
influence of co-usage relationships, are then used to prioritize
drugs for placement within the ADDS.
C. Optimized placement matrix
Based on the computed ranking scores, drugs are sorted in
descending order. The optimized placement matrix is
constructed by mapping this sorted order into the physical
layout of the storage rack. In our system, the most accessible
compartments are located in the leftmost columns and bottom
rows. Thus, the highest ranked drugs are assigned to these
positions, thereby minimizing retrieval times.
D. ADDS and Retrieval Time Model
The Automated Drug Dispensing System (ADDS) is a cutting-
edge solution designed to enhance the efficiency of
pharmaceutical storage and retrieval processes. Existing
research on order picking assumes that orders have already
been assigned to the machines. The system comprises a storage
rack divided into discrete bins, each uniquely identified by
coordinates 󰇛󰇜 where represents the row and the
column. Each bin has fixed dimensions (see
Table 1), ensuring a standardized storage environment. Drugs are
stored within these bins, and a robotic arm is employed to
execute pick-and-retrieve operations guided by an intelligent
control algorithm that computes the optimal route for retrieval
[11].
The picking process initiates at an input/output (I/O) point,
which in our study is fixed at 󰇛󰇜 In our setup, we work
with a single rack where drugs are stored in discrete bins. From
this I/O point, the robotic arm travels to the bin containing the
required drug and then returns to the I/O point. The single-
command retrieval time for the robotic arm traveling from the
I/O point to a bin is expressed as:
󰇛󰇜󰇧
󰇨
(7)
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
Fig 2 Dual command cycle operation of the robotic arm.
For dual-command operations as in Fig 2, where the arm
retrieves two bins sequentially before returning to the I/O point,
the travel time is calculated as:
󰇛󰇜󰇧
󰇨
󰇧
󰇨
󰇧
󰇨
(8)
Table 1 System parameters.
Parameter
Value/Description
d (Bin Length)
0.168 m
d (Bin Height)
0.275 m
v󰨒󰣜 (Robotic Arm
Speed)
0.1486 m/s
v󰨕 (Horizontal Speed)
Typically, equal to v󰨒󰣜
v󰣠 (Vertical Speed)
Typically, equal to v󰨒󰣜
I/O Point Location
(x, y): starting/ending point for
retrieval
V. RESULTS AND DISCUSSION
In this section, we present and analyze the outcomes of our
GCN-based ranking approach for optimizing drug placement in
the ADDS (Fig 3). Our evaluation compares the optimized
placement against 30 random placements using several visual
and statistical measures.
It is important to note that the proposed approach is fully
adaptable to any ADDS, regardless of its specific architecture.
In systems employing dual-command operations, such as the
manipulator-arm-based design discussed here, our method
effectively leverages both consumption and co-usage data to
optimize drug placement. Similarly, in Free-Fall-Flow-Rack
AS/RS automated drug dispensing systems, the prescription
retrieval speed is determined by the maximum travel time
among the drugs in a prescription [1]. In such cases, our dual-
data approach remains equally applicable, as it naturally
identifies the critical drug whose retrieval time governs the
overall performance. This inherent flexibility underscores the
broad applicability and robustness of our method across various
ADDS designs.
Fig 4: Mean retrieval times for the optimized placement versus 30 random
placements.
Fig 4 shows a bar chart of the mean retrieval times, The
optimized placement consistently achieves a lower average
retrieval time than any random configuration. This significant
reduction in mean retrieval time indicates that incorporating
Fig 3: Optimized ADDS Placement Matrix
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
drug consumption frequency and co-usage data into the GCN
framework leads to more efficient retrieval operations.
Fig 5: Drug embeddings visualized via t-SNE, colored by GCN-assigned
ranking scores.
The drug classification ability of the GCN becomes more
apparent through visual t-SNE embeddings shown in Fig 4. A
scatter plot shows each drug point and its ranking score is
represented by color distribution. The upper-left portion
contains drugs that obtain higher ranking scores displayed
through warmer colors, whereas lower scoring drugs position
in the bottom-right section shown with cooler color schemes.
The clinical GCN predicts drug importance through an even
distribution of drug data points across the plot, indicating its
ability to understand drug consumption relationships and co-
usage patterns.
Fig 6: Retrieval time distribution for the optimized placement versus the top
five random placements.
Further, a box-and-whisker plot (Fig 6) compares the
distribution of retrieval times for the optimized placement and
the top five best-performing random placements. The
optimized method not only has a lower median retrieval time
but also exhibits reduced variability, with fewer extreme
outliers. This robustness in performance suggests that our
approach provides both consistent and reliable improvements
in retrieval efficiency.
Overall, these results confirm that our GCN-based ranking
model effectively leverages drug consumption frequency and
co-usage data to generate an optimized placement strategy for
ADDS. The significant reduction in average retrieval time
improving over the average random placement by
approximately 27% demonstrates the practical benefits of our
method. The clear separation in the embedding space, as shown
by the t-SNE visualization, further validates that the GCN
captures meaningful relationships among drugs. By aligning
drug placement with actual usage patterns, our approach holds
the potential to reduce wait times, minimize errors, and enhance
the overall efficiency of pharmaceutical distribution in
healthcare settings.
VI. CONCLUSION
In this study we presented a GCN-based ranking approach
for optimizing drug placement within an Automated Drug
Dispensing System (ADDS). By integrating monthly
consumption frequencies and co-use relationships, our model
learns a meaningful embedding space and assigns higher scores
to drugs that require faster access. This ranking translates into
an optimized placement matrix that ensure that frequently used
or strongly co-used drugs are located in compartments that
minimize retrieval times.
Comparisons with 30 randomly generated placements
confirm the effectiveness of our strategy. In particular, our
method outperforms the average random placement by
approximately 27%. We choose to highlight this comparison
with the average random approach because it offers a robust
benchmark that reflects typical random performance, rather
than a single best-case scenario. This improvement underscores
the GCN's ability to capture and leverage important drug
interactions for practical gains in retrieval speed and
operational consistency.
overall, the proposed solution demonstrates the potential of
graph-based models to enhance storage and retrieval efficiency
in healthcare environments, Future work could explore more
advanced architectures or incorporate additional drug attributes
to further refine the ranking process and adapt to evolving
clinical demands.
Funding Declaration: This research received no specific grant
from any funding agency in the public, commercial, or not-for-
profit sectors.
Ethics and Consent to Participate: Not applicable.
Competing Interests: The authors declare that they have no
competing interests.
Software and Tools: The machine learning models were
developed using PyTorch (version 2.6.0), an open-source deep
learning framework. The authors also used Python and various
other libraries to process and analyze the data.
REFERENCES
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
[1] D. Metahri and K. Hachemi, “A Performance Comparison of Manual
Dispensing and Automated Drug Delshivery,” IJARPHM, vol. 5, no. 1,
pp. 113, Jan. 2020, doi: 10.4018/IJARPHM.2020010101.
[2] S. MacAvaney, N. Tonellotto, and C. Macdonald, “Adaptive Re-
Ranking with a Corpus Graph,in Proceedings of the 31st ACM
International Conference on Information and Knowledge Management
(CIKM ’22), ACM, 2022. doi: 10.1145/3511808.3557231.
[3] P. Veličković, “Everything is Connected: Graph Neural Networks,”
Current Opinion in Structural Biology, vol. 79, p. 102538, 2023, doi:
10.1016/j.sbi.2023.102538.
[4] L. Pang, J. Xu, Q. Ai, Y. Lan, X. Cheng, and J. Wen, “Setrank:
Learning a permutation-invariant ranking model for information
retrieval,” in Proceedings of the 43rd International ACM SIGIR
Conference on Research and Development in Information Retrieval,
2020, pp. 499508.
[5] Q. Wu et al., “Dual Graph Attention Networks for Deep Latent
Representation of Multifaceted Social Effects in Recommender
Systems,” in The World Wide Web Conference, ACM, 2019. doi:
10.1145/3308558.3313442.
[6] J. C.-H. Pan, P.-H. Shih, M.-H. Wu, and J.-H. Lin, “A storage
assignment heuristic method based on genetic algorithm for a pick-and-
pass warehousing system,” Computers & Industrial Engineering, vol.
81, pp. 113, Mar. 2015, doi: 10.1016/j.cie.2014.12.010.
[7] J. Reyes, E. Solano-Charris, and J. Montoya-Torres, “The storage
location assignment problem: A literature review,” International
Journal of Industrial Engineering Computations, vol. 10, no. 2, pp.
199224, 2019, Accessed: Sep. 30, 2023. [Online]. Available:
http://growingscience.com/beta/ijiec/2956-the-storage-location-
assignment-problem-a-literature-review.html
[8] W. H. Hausman, L. B. Schwarz, and S. C. Graves, “Optimal Storage
Assignment in Automatic Warehousing Systems,” Management
Science, vol. 22, no. 6, pp. 629638, 1976, doi: 10.1287/mnsc.22.6.629.
[9] K. J. Roodbergen and I. F. A. Vis, “A survey of literature on automated
storage and retrieval systems,” European Journal of Operational
Research, vol. 194, no. 2, pp. 343362, Apr. 2009, doi:
10.1016/j.ejor.2008.01.038.
[10] V. R. Muppani and G. K. Adil, “A branch and bound algorithm for
class based storage location assignment,” European Journal of
Operational Research, vol. 189, no. 2, pp. 492507, 2008, Accessed:
Jan. 09, 2025. [Online]. Available:
https://econpapers.repec.org/article/eeeejores/v_3a189_3ay_3a2008_3ai
_3a2_3ap_3a492-507.htm
[11] M. Yuan, N. Zhao, K. Wu, and Z. Chen, “The storage location
assignment problem of automated drug dispensing machines,”
Computers & Industrial Engineering, vol. 184, p. 109578, Oct. 2023,
doi: 10.1016/j.cie.2023.109578.
[12] E. Atmaca and A. Ozturk, “Defining order picking policy: A storage
assignment model and a simulated annealing solution in AS/RS
systems,” Applied Mathematical Modelling, vol. 37, no. 7, pp. 5069
5079, Apr. 2013, doi: 10.1016/j.apm.2012.09.057.
[13] A. Chaker and K. Hachemi, Evaluation des performances et pilotage
d’une armoire automatisée de dispensation de médicaments Mots clés,”
presented at the Congrès Lambda Mu 21 «󰨠Maîtrise des risques et
transformation numérique󰨠: opportunités et menaces󰨠», Oct. 2018.
Accessed: Sep. 17, 2023. [Online]. Available: https://hal.science/hal-
02074918
[14] N. Esmaili, B. A. Norman, and J. Rajgopal, “Shelf-space optimization
models in decentralized automated dispensing cabinets,” Operations
Research for Health Care, vol. 19, pp. 92106, Dec. 2018, doi:
10.1016/j.orhc.2018.03.005.
[15] K. Hachemi and H. Alla, “Affectation de médicaments dans un sysme
automatisé dedispensation de médicaments󰨠: approche basée sur la
synthèsede contrôleur par réseau de Petri,Oct. 2013.
[16] K. Hachemi and S. Amari, “Analytical solving of the storage location
assignment problem in drug dispensing systems based on a Min-Plus
control approach, Journal of Control Engineering and Applied
Informatics, vol. 26, no. 3, Art. no. 3, Sep. 2024, doi:
10.61416/ceai.v26i3.9015.
[17] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,
Neural Message Passing for Quantum Chemistry,arXiv preprint
arXiv:1704.01212, 2017.
[18] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive Representation
Learning on Large Graphs,” in Proceedings of the 31st International
Conference on Neural Information Processing Systems (NIPS’17),
Curran Associates Inc., 2017, pp. 10251035.
[19] T. N. Kipf and M. Welling, “Semi-Supervised Classification with
Graph Convolutional Networks,” arXiv preprint arXiv:1609.02907,
2017.
[20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y.
Bengio, “Graph Attention Networks,” arXiv preprint
arXiv:1710.10903, 2018.
[21] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How Powerful are Graph
Neural Networks?,” arXiv preprint arXiv:1810.00826, 2019.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
Feature Extraction and Machine Learning for
Classification Date Fruit
Ikram Kourtiche1, Mostefa Bendjima2, Mohammed El Amin Kourtiche3
1 2 3 Tahri Mohamed University, Mathematics and Computer Science Department Laboratory of TIT Bechar,
Algeria
1kourtiche.ikram@univ-bechar.dz,
2bendjima.mostefa@univ-bechar.dz,
3kourtiche.amin@univ-bechar.dz
Abstract Dates are important in many parts of the world,
particularly in North Africa and the Middle East. As a highly
nutritious fruit with strong demand in both local and international
markets, the classification and quality control of dates play a
crucial role in enhancing their commercial value. This work
focuses on improving date fruit classification by applying data
augmentation techniques to enrich the original dataset, and then
we employed three pre-trained CNN models, ResNet50,
EfficientNetB0, and DenseNet201, for feature extraction. The
extracted features were then classified using traditional machine
learning algorithms: Support Vector Machine (SVM), K-Nearest
Neighbors (KNN), Logistic Regression (LR), and Random Forest
(RF). The best performance was achieved using ResNet50 as a
feature extractor with logistic regression for classification,
reaching an accuracy of 97.42%.
KeywordsDate fruit, classification, feature extraction, pre-
trained CNN, machine learning.
I.
INTRODUCTION
In recent years, the rapid progress of artificial intelligence
(AI) has brought about significant transformations across a wide
range of sectors, including agriculture[1].AI has become an
indispensable tool for addressing complex agricultural
challenges by offering innovative solutions that enhance both
efficiency and sustainability on a global scale [2]. Among these
advancements, deep learning has been instrumental in
revolutionizing various agricultural practices, including fruit
classification [3].One fruit that has garnered increasing attention
in this context is the date, known for its high nutritional value,
rich in carbohydrates, minerals, and vitamins, and recognized
for its potential health benefits, such as reducing the risk of
cancer and cardiovascular diseases. Globally, date production is
substantial, with an estimated annual output of approximately
8.46 million tons [4].
In recent years many studies have been published on the
classification of date fruits:A date fruit classification system was
developed in [5] to identify six date types. Features were
recognized by CNN models. Their dataset has 2246 images.
Comparing the system to MobileNetV1, Inception, and Resnet,
MobileNetV1 had the highest accuracy (82.67%). In [6] transfer
learning was employed to classify images using the pre-trained
models MobileNetV2, VGG 19 and ResNet50. The VGG19
model has achieved the best classification accuracy (95%) and
highest overall accuracy compared to other models.
Altaheri et al. [7] introduced a machine vision framework for
classifying date fruits according to their type, maturity, and harvest
readiness in a natural orchard setting. This framework leverages
deep convolutional neural networks (CNNs) and transfer learning
to achieve high classification accuracy, utilizing a dataset of 8000
images. Notably, the framework achieved a type classification
accuracy of 99.01%.
In [4], researchers evaluated various algorithms, including
Decision Tree, K-Nearest Neighbors (KNN), and Support Vector
Machines (SVM), for classifying seven date varieties. the neural
network model yielded the highest accuracy at 93.85%.
Alsirhani et al. [8] presented a deep transfer learning approach
for the classification of 27 distinct date varieties using a dataset
of 3228 images. By fine-tuning a DenseNet201 model, the
researchers attained a test accuracy of 95.21%.
A study conducted by [9] investigated a comprehensive
dataset comprising 8,000 images of five distinct date fruit
varieties. The performance of pre-trained deep learning
models:GoogleNet, ResNet-50, DenseNet, and AlexNet, was
evaluated on this dataset.The results indicate that ResNet-50
outperformed the other models, achieving an accuracy rate of
97.37%.
In our study, we propose a method for classifying date fruits
using feature extraction from three pre-trained convolutional
neural network models: ResNet50, DenseNet201, and
EfficientNetB0. The features extracted from each model are
subsequently classified using various machine learning
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
algorithms, including Support Vector Machine (SVM), K-
Nearest Neighbors (KNN), Logistic Regression (LR), and
Random Forest (RF).
The remainder of this paper is structured as follows: After
the introduction, we describe materials and methods used in this
study. The next section presents the experimental results,
followed by a discussion of the findings. Finally, the conclusion
is drawn in the last section.
II.
MATERIAL AND METHODS
A.
PRE-TRAINED MODELS
1)
ResNet-50 a residual neural network, has 50 layers and
constructs a network by sequentially stacking residual blocks.
The architecture has 48 convolutional layers, one MaxPooling
layer, and one average pooling layer. ResNet-50 is a popular
picture categorization system[10].
2)
EfficientNet-B0 is a convolutional neural network
optimized for high performance with fewer parameters. It uses
depthwise separable convolutions and squeeze-and-excitation
(SE) modules, gradually decreasing spatial resolution and
increasing channels. The architecture balances depth, width,
and input resolution, achieving high accuracy and low
computational cost[10].
3)
DenseNet201 is a convolutional neural network with
direct feed-forward connections, which reduces gradient
degradation and overfitting in deep learning applications. Its
architecture enhances inputs at each layer, diminishes
parameters, and elevates performance. DenseNet201, a
version of 201 layers, employs this compact architecture to
develop models that are easy to train and exceptionally
efficient [11].
B.
Classification methods
1)
Support Vector Machines (SVM): is a highly esteemed
traditional approach in machine learning, commonly used for
both classification and regression tasks. It works by
transforming data characteristics into higher dimensions to
establish a boundary or hyperplane for classification. The SVM
identifies a linear discriminant function that maximizes the
margin between different classes of data. Support vectors,
which are data points closest to the classification boundary,
play a crucial role in defining this boundary. SVM is well- known
for its accuracy and versatility, making it a popular choice in
applications [12].
2)
Random Forest (RF): The decision tree method is
extensively employed for categorizing extensive datasets and
identifying data that share common traits. It involves dividing
the data into smaller subsets iteratively, culminating in the
construction of a structured tree that includes both decision
nodes and leaf nodes, yielding the final classification
outcomes [12].
3)
k nearest neighbors (KNN): The operational principle of
the KNN classifier is direct and intuitive: it assigns categories to
samples based on the classes of their nearest neighbors. This
classification method, known as memory-based classification,
re-quires storing training samples in memory
for reference during analysis [13] In this paper, the
parameter k is set to 9.
4)
Logistic Regression (LR): is a commonly used statistical
method for modeling the probability of a binary outcome
based on one or more explanatory variables. Its primary goal
is to estimate the coefficients of a linear model that relates the
logarithm of the odds (log-odds) to the independent variables
[14].
C.
Dataset
This dataset, referred to as the Saudi Arabian Dataset,
consists of 1658 images, each depicting one of nine date fruit
varieties native to Saudi Arabia: Ajwa, Galaxy, Medjool, Nabtat
Ali, Sokari, Rutab, Shaishe, Sugaey, and Meneifi, as shown in
Figure 2. A controlled environment was constructed to take
pictures of the 9 different types. The imaging setup consists of a
mounted DSLR camera (Canon EOS 550D) with the flash
enabled, a ring light with a 48-centimeter diameter, and 240 LED
bulbs set to 100% brightness. A ring was used to negate any
shadows by surrounding the date with light on all sides; the flash
on the camera provides a strong, sudden light to the center to
emphasize the fleshiness or flabbiness of the date[15].
Fig. 1 Samples of date fruit dataset images
D.
Data augmentation
A significant aspect of this study is the use of data
augmentation to enhance model performance. Data
augmentation is a crucial strategy in machine learning that
involves artificially increasing the size and diversity of a dataset
by applying various transformations to the existing data. In this
context, several augmentation techniques were employed,
including:
1)
Rescaling: The process of adjusting the size of
images.
2)
Random zoom: Modifies the image scale to simulate
varying distances.
3)
Flipping: Involves mirroring the images to create
variations.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
4)
Width and height shifts: Slightly reposition the
images to account for different orientations.
5)
Random rotations: Rotate the images at various
angles.
These techniques are designed to improve the model's ability
to generalize by exposing it to a wider variety of data
representations. By augmenting the datasets in this manner, the
study aims not only to enhance classification accuracy but also
to ensure that the models can effectively recognize and
differentiate between various date varieties. After applying data
augmentation, the dataset comprises 3460 images of Saudi
Arabian date fruit.
After augmentation, the dataset was divided into two
subsets: 80% for training and 20% for testing.
III.
EXPERIMENTAL RESULTS AND DISCUSSIONS
A.
Evaluation metrics used
In our study, we used several evaluations. These
measures aim to evaluate the performance rate of our model.
Precision, recall, f1-score, , and accuracy were determined by
quantifying the predicted classes based on the following
quantities: the number of false negatives (FN), false positives
(FP), true negatives (TN), and true positives (TP). The
mathematical representation's definition is outlined below:
Accuracy = (1)
Precision = (2)
Recall = (3)
f1-score = (4)
B.
Results
Our experiments are used on a computer with an Intel(R)
Core(TM) i5-6300U CPU, with 8 GB of RAM, utilizing Kaggle,
a cloud-based platform that enables users to write and execute
Python code directly in their web browsers. Kaggle is
particularly advantageous for machine learning, data analysis,
and deep learning tasks, as it offers GPU support for accelerated
computation. This environment facilitates efficient
experimentation and model training by providing access to
powerful resources and tools tailored for data science
applications.
The results of our study are summarized in Tables I,II, and
III. The results in Table I indicate that the LR classifier achieved
the highest performance among the compared algorithms, with a
testing accuracy of 97.42%, a recall of 97.42%, an F1-score of
97.42%, and a precision of 97.44%. These results highlight the
effectiveness of feature extraction using the Resnet50 Table II
presents the results after feature extraction using
EfficentNetB0, where the best performance was achieved with
the Logistic Regression (LR) classifier, reaching an accuracy
of 97.27%, and the table III shows the results after feature
extraction using DenseNet201, where the LR classifier obtained
the highest accuracy of 96.84%.
Logistic Regression outperforms all other models across the
three feature extraction methods (ResNet-50, EfficientNetB0,
DenseNet201). Its strong performance is likely due to the
extracted features being well-structured and linearly separable.
LR remains a simple, efficient choice for this kind of
classification task.
TABLE I. PERFORMANCE METRICS FOR RENET-50 EXTRACTED FEATURES
Models
Accuracy
Recall
Precision
F1-score
SVM
93.26%
93.26%
93.33%
93.25%
LR
97.42%
97.42%
97.44%
97.42%
KNN
85.51%
85.51%
86.28%
85.61%
RF
89.81%
89.81%
89.87%
89.79%
TABLE II. PERFORMANCE METRICS FOR EFFICENTNETB0 EXTRACTED
FEATURES
Models
Accuracy
Recall
Precision
F1-score
SVM
94.26%
94.26%
94.37%
94.26%
LR
97.27%
97.27%
97.32%
97.28%
KNN
88.24%
88.24%
88.63%
88.29%
RF
90.67%
90.70%
90.67%
90.63%
TABLE III. PERFORMANCE METRICS FOR DENSENET201 EXTRACTED FEATURES
Models
Accuracy
Recall
Precision
F1-score
SVM
94.12%
94.12%
94.13%
94.11%
LR
96.84%
96.84%
96.91%
96.85%
KNN
88.52%
88.52%
89.11%
88.52%
RF
93.11%
93.11%
93.24%
93.52%
Figures 2, 3, and 4 show the confusion matrices for the
best- performing methods using features extracted with
Resnet50, EfficentNetB0 , and densenet201, respectively.
Fig. 2 Confusion matrix for LR using Resnet-50
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
Fig. 3 Confusion matrix for LR using EfficentNetB0
Fig. 4 Confusion matrix for LR using DenseNet201
Our proposed study is evaluated against several recent state-
of-the-art techniques, as presented in Table IV. demonstrates
superior performance using a dataset with 9 different types of
date fruit.
TABLE IV. COMPARISON WITH STATE OF THE ART METHODS
IV.
CONCLUSION
The objective of this study was to develop a classification
system for date fruits by utilizing feature extraction from three
pre-trained convolutional neural network models: ResNet50,
EfficientNetB0, and DenseNet201. The extracted features were
subsequently classified using traditional machine learning
algorithms, including Support Vector Machine (SVM), Logistic
Regression (LR), Random Forest (RF), and K-Nearest
Neighbors (KNN). This research aims to support and improve
agricultural practices related to date fruit classification.
For future work, we plan to apply this approach to other
agricultural products, with the aim of improving classification
accuracy.
References
[1]
S. Sukkasem, W. Jitsakul, et P. Meesad, « Fruit
Classification with Deep Transfer Learning using Image
Processing », in 2023 7th International Conference on
Information Technology (InCIT), Chiang Rai, Thailand: IEEE,
nov. 2023, p. 464‑469. doi:
10.1109/InCIT60207.2023.10413036.
[2]
S. Meghwanshi, « ARTIFICIAL INTELLIGENCE IN
AGRICULTURE: A REVIEW », Open Access, vol. 06,
no 03.
[3]
H. S. Gill et B. S. Khehra, « An integrated approach using
CNN-RNN-LSTM for classification of fruit images »,
Materials Today: Proceedings, vol. 51, p. 591‑595, 2022,
doi: 10.1016/j.matpr.2021.06.016.
[4]
Department of Mathematics, Atatürk University, Faculty
of Science, Erzurum, Turkey et Ö. Özaltin, « Date Fruit
Classification by Using Image Features Based on Machine
Learning Algorithms », Research in Agricultural Sciences,
vol. 55, no 1, p. 26‑35, janv. 2024, doi:
10.5152/AUAF.2024.23171.
[5]
Md. A. Khayer, Md. S. Hasan, et A. Sattar, « Arabian Date
Classification using CNN Algorithm with Various Pre-
Trained Models », in 2021 Third International Conference
on Intelligent Communication Technologies and Virtual
Mobile Networks (ICICV), Tirunelveli, India: IEEE, févr.
2021, p. 1431‑1436. doi:
10.1109/ICICV50876.2021.9388413.
[6]
H. Bichri, A. Chergui, et M. Hain, « Image Classification
with Transfer Learning Using a Custom Dataset:
Comparative Study », Procedia Computer Science, vol.
220, p. 48‑54, 2023, doi: 10.1016/j.procs.2023.03.009.
[7]
H. Altaheri, M. Alsulaiman, et G. Muhammad, « Date Fruit
Classification for Robotic Harvesting in a Natural
Environment Using Deep Learning », IEEE Access, vol. 7,
p. 117115‑117133, 2019, doi:
10.1109/ACCESS.2019.2936536.
[8]
A. Alsirhani, M. H. Siddiqi, A. M. Mostafa, M. Ezz, et A.
A. Mahmoud, « A Novel Classification Model of
Date Fruit Dataset Using Deep Transfer
Learning »,
Ref
Year
Technique
Date
type
Best Accuracy
[5]
2021
Various pre-trained models
(MobileNet,Inception, and Resnet)
6
MobileNetV1
82.67%
[9]
2021
GoogleNet, ResNet50, DenseNet and
AlexNet
5
ResNet50 97.37%
[14]
2021
Stacking model created by
combining LR and ANN
7
92.80%
[16]
2019
Features extraction+ combination of
several hidden layers
3
97.20%
[17]
2019
VGG16
4
96.98%
Our
Study
_
Feature extraction using ResNet50,
DenseNet201, EfficientNetB0 and
several machine learning algorithms
9
Feature extraction
using Resnet50+LR
97.42%
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
Electronics, vol. 12, no 3, p. 665, janv. 2023, doi:
10.3390/electronics12030665.
[9]
A. Al-Sabaawi, R. I. Hasan, M. A. Fadhel, O. Al- Shamma, et
L. Alzubaidi, « Employment of Pre-trained Deep Learning
Models for Date Classification: A Comparative Study », in
Intelligent Systems Design and Applications, vol. 1351, A.
Abraham, V. Piuri, N. Gandhi,
P. Siarry, A. Kaklauskas, et A. Madureira, Éd., in
Advances in Intelligent Systems and Computing, vol.
1351. , Cham: Springer International Publishing, 2021, p.
181‑189. doi: 10.1007/978-3-030-71187-0_17.
[10]
N. Ahmed, M. Rahman, et F. Ishrak, « Comparative
Performance Analysis of Transformer- Based Pre- Trained
Models for Detecting Keratoconus Disease ».
[11]
S. Dümen, E. Kavalcı Yılmaz, K. Adem, et E. Avaroglu,
« Performance of vision transformer and swin
transformer models for lemon quality classification in fruit
juice factories », Eur Food Res Technol, vol. 250, no 9, p.
2291‑2302, sept. 2024, doi: 10.1007/s00217-024-
04537-5.
[12]
T. S. Xian et R. Ngadiran, « Plant Diseases Classification
using Machine Learning », J. Phys.: Conf. Ser., vol. 1962, no
1, p. 012024, juill. 2021, doi: 10.1088/1742-
6596/1962/1/012024.
[13]
H. Y. Bayram, H. Bingol, et B. Alatas, « Hybrid Deep Model
for Automated Detection of Tomato Leaf Diseases », TS,
vol. 39, no 5, p. 1781‑1787, nov. 2022, doi:
10.18280/ts.390537.
[14]
M. Koklu, R. Kursun, Y. S. Taspinar, et I. Cinar,
« Classification of Date Fruits into Genetic Varieties Using
Image Analysis », Mathematical Problems in Engineering,
vol. 2021, p. 1‑13, nov. 2021, doi: 10.1155/2021/4793293.
[15]
W. Alhamdan, J.M. Howe, « Classification of date fruits in
a controlled environment using convolutional neural
networks », in: Advanced Machine Learning Technologies
and Applications, vol. 9, (1) Springer, 2021,
pp. 154163. doi:10.1007/978-3-030-69717-4_16.
[16]
A. Magsi, Department of Computer Science, Shah Abdul
Latif University, Khairpur, Pakistan;, J. Ahmed Mahar,
Department of Computer Science, Shah Abdul Latif
University, Khairpur, Pakistan;, S. H. Danwar, et
Department of Computer Science, Shah Abdul Latif
University, Khairpur, Pakistan;, « Date Fruit Recognition
using Feature Extraction Techniques and Deep
Convolutional Neural Network », Indian Journal of Science
and Technology, vol. 12, no 32, p. 1‑12, août 2019, doi:
10.17485/ijst/2019/v12i32/146441.
[17]
A. Nasiri, A. Taheri-Garavand, et Y.-D. Zhang, « Image-
based deep learning automated sorting of date fruit »,
Postharvest Biology and Technology, vol. 153, p. 133‑141,
juill. 2019, doi:
10.1016/j.postharvbio.2019.04.003.
Collaborative business process: A formal
verification and validation
1st Hanane Ouaar
Computer Science Department, Biskra University
LINFI Laboratory
Biskra, Algeria
hanane.ouaar@univ-biskra.dz
AbstractBusiness Processes (BP) formal validation and veri-
fication form the basis of the current work. The design, spec-
ification, and implementation of a simulation application for
assembly-line automobiles are the goals. The case study was
selected because it offers several perspectives on a number of
company operations, including the administrator’s perspective,
which permits component setting, account administration, and
system configuration.The technical team believes that identifying
the damage, completing onsite maintenance, publishing the cur-
rent tacked measures, and modifying reports to comprehend the
state of the existing system are all important. The supply chain
and robotic arms are components of the system business processes
that are identified during the analysis phase. UML diagrams
are used in the design phase. In order to formally validate the
system behavior through CTL, system synchronization, business
process simulation, and validation of the majority of properties,
the specification phase uses the UPPAL software tool to develop
a timed automaton. The best software tools for creating the
simulation system are used throughout the implementation phase.
Index TermsBusiness processes, assembly-line cars, robotic
arms, supply chain application, UML, UPPAL, CTL, simulation.
I.
INTRODUCTION
A discipline for managing the lifespan of business processes
(BP) [1], from the modeling stage to process enactment and
improvement, while accounting for all the many stakeholders
involved, is business process management (BPM) [3].
Both the computer science and business administration
communities have recently paid close attention to BPM.
BPM is a well-established field that propels business success
by means of successful and efficient business processes,
claims . Capability frameworks, which outline and group
capability areas pertinent to implementation or the orientation
process in businesses, are a standard way to structure business
process management (BPM). Otherwise, by facilitating better
coordination between technology and human resources,
BPM seeks to assist businesses in being more effective.
In terms of company operations, it improves visibility and
streamlines procedures [2]. As a result, businesses that
embrace BPM techniques and technology get a quick return
on investment and improve the efficiency of their current
systems [3]. Moreover, BP provides a means of coordinating
interactions between workers and organizations in a structured
way. However, the dynamic nature of the modern business
environment means that some BP should be externalized,
i.e., accept new BP from outside or let local BP displace off
boundary [1]. So, the challenge is to provide flexibility and
to offer external process support at the same time. However,
current BPM suffers from some limitations in optimization
due to the lack of good monitoring methods because the
involved control of internal and external BP achieves both
the company’s business strategy and its global objectives [3].
Since its features are best suited for the aspect of our
modeling, the assembly line [4] has been selected as a study
case of a modernized information system. Additionally, be-
cause of the dynamic nature of their business process and
their various qualities, it provides a synchronal simulation
application. The current work’s goal was approached in a
way that will concentrate on: The public and private business
processes in the automobile assembly line are identified and
their constituent parts are described during the analysis phase.
During the design phase, we concentrate on leveraging UML
models to simply and easily graphically describe business
process management. The next step is to use a car assembly
line as a case study in order to use different kinds of UML
diagrams (static behavior and dynamic behavior) to identify
and develop its business processes [5].During the specification
phase, we use UPPAAL [6] as the software tool to evaluate
and validate our system utilizing a temporal state machine
automat. TCL (Computation Tree Logic) [8] leverages the
model checking [7] to formally prove the system-modeled
properties. Lastly, during the implementation phase, it delivers
soft handling services that are accessible to all users through
various reporting and configuration stages and displays the
application as a simulation system of their business process.
II.
OVERVIEW
Several paradigms were used for the formalization and
verification of BP models, such as colored Petri Nets, Pi-
Calculus, Timed Automata, etc. Here are some related works
that used the last paradigm.
[10]
Use the Petri Nets model to describe BP with
their resource consumption to verify BP properties in cloud
computing. In order to verify the efficiency of initial resource
provisioning between different BP services and to verify the
partial elasticity of BP in cloud computing based on the
initial allocation of resources. They suggest a verification
methodology based on a formal model to verify resource
consumption properties and select services with low resource
consumption. In order to reduce the cost of elasticity of BP
resources in cloud computing.
To support business process reengineering (BPR) efforts,
[11]
proposed a framework based on high-level Petri nets.
This framework is used to model and analyze business
processes. The use of high-level Petri nets provides advanced
analysis techniques and sophisticated software tools.
[12]
Developed a model based on Colored Petri Nets
(CPN), the Interactive Business Process Fusion (IBPF) net,
which is adept at identifying such vulnerabilities during the
design phase. However, the analysis methods for IBPF net
still urgently need innovation. In addressing this issue, they
used dynamic slicing techniques to analyze IBPF net, serving
as a method for revealing logical vulnerabilities. They obtain
backward slices, partial forward slices, and bidirectional slices
through the slicing algorithms. Eventually, these three types
of slices are merged to form the final dynamic slice. This
technique, which involves a more targeted analysis than exam-
ining the entire IBPF net, simplifies the analysis process and
prevents state space explosion, thereby providing a distinct
advantage. The results of this research are of great value
in enhancing system reliability, reducing maintenance costs,
and providing analysis techniques in the field of e-business
security.
[13]
Investigate how to leverage Model Learning
(ML) algorithms for the automated discovery of DFAs
from event logs. DFAs can be used as a fundamental
building block to support not only the development of
process analysis techniques but also the implementation of
instruments to support other phases of the Business Process
Management (BPM) lifecycle, such as business process
design and enactment. The quality of the discovered DFAs
is assessed with customized definitions of fitness, precision,
generalization, and a standard notion of DFA simplicity.
Finally, they used these metrics to benchmark ML algorithms
against real-life and synthetically generated datasets, with
the aim of studying their performance and investigating their
suitability to be used for the development of BPM tools.
[14]
Proposes a novel semantic-based e-business contract
model-Simple Natural Contract (SimNC), to represent a
universal contract created by a Supervised Sentence Contract
(SSC), which is inputted via a Semantic Input Method
(SIM) with strict grammar from a human-understandable
natural language contract. Then, the SSC is analyzed through
Machine Natural Language (MNL) to enhance contract
semantic understanding by enabling case grammar for
crossing language parties. In doing so, SimNC analyzes
various deontic components and combines them with the
operational aspects of a legal contract to achieve a common
and better understanding between hard code and natural
language. In addition, they apply the SimNC into a Network
of Timed Automata (NTA) for supporting automation, which
builds a formal model including temporal constraints and
then translates it into an executable SimNC-NTA model. This
work aims to provide a bridge between natural language
contracts and e-business contracts, making them universal
and intelligible.
In [15], Pi-calculus is chosen as the modelling and
analysis means for cross-organizational business processes.
Furthermore, on the basis of Pi-calculus, the deadlock
verification method of process is proposed, and the formal
descriptions of several typical reduction rules are presented.
Finally, a case study is presented, and the result shows that the
proposed method can achieve the deadlock detection of large-
scale and complicated cross-organizational business processes.
III.
CASE STUDY: CAR ASSEMBLY LINE
The assembly line cars are selected as a study case; this part
introduces the system components, define them, and model
their business processes using UML.After that, UPPAAL is
used for specification and formal verification of this system,
elaborating on their automaton time synchronization, defining
all walk probabilities, and after that, validating by CTL if the
automaton time is true or false. As a result, although it is a
long time to study, it gives a lot of positives. This part defines
all the action and the ingredient that will be wanted, knowing
the probability source of risk, robotics, and the safety system
for the safety of its stages.
A.
Analysis
An assembly line car is a manufacturing process, often
called a progressive assembly, in which parts (usually in-
terchangeable parts) are added as the semi-finished assem-
bly moves from workstation to workstation, where the parts
are added in sequence until the final assembly is produced.
However, these systems are considered very critical in time
(synchronization), in risk (robots, automatic arms) and in cost
(expensive maintenance). So, the main factors need to simulate
all system’s behavior before implanting the real and the hard
systems.
1)
Cars assembly line Definition: An assembly line cars
continent two principle business Processes (BP): robotic arms
and supply chain [4]:
Robotic arms: machines that are programmed to execute
a specific task or job quickly, efficiently, and extremely
accurately. Generally, motor-driven, they’re most often
used for the rapid, consistent performance of heavy
and/or highly repetitive procedures over extended periods
and are especially valued in the industrial production,
manufacturing, machining, and assembly sectors.
Supply chain: operates on three levels: strategic, tactical,
and operational. While the strategic approach is generally
about improving network resources such as network
design, location, facility count determination, etc., tactical
decisions deal with mid-term,including production levels
in all factories, assembly policy, inventory levels, and lot
sizes.
2)
Motivations: An assembly line car has been chosen as
a study case of a modernized information system, and this
choice is backed by the following motivations: - For the first
time, an auto-makeup application has been made with a view
to increasing production.
-
It’s easy for us to find the malaise quickly without wasting
it.
-
Easy to distinguish between its business processes, which
leads to a better understanding of how our modeling will be.
-
Work is clear, and this makes it easy for us to synchronize
business process components according to time.
-
Its characteristics are the most suitable for the aspect of our
modeling.
B.
Conception
This part provides the system modeling using UML dia-
grams. “Fig. 1” shows in the use case diagram of an assembly
line car with a chain that represents the functionalities that the
table:
TABLE I
ACTOR AND USE CASES DESCRIPTION
different actors can do: technician and his relationship with the
robotic arms and supply chain.
Fig. 1. Use case diagram
: And the explanation of the main actors roles and the as-
sociated use cases for each actor is presented in the following
C.
Formal specifications
This part studies the system specification by a system of
transitions (timed automaton).
1)
Formal specification software tools: UPAAL [4]: is
an integrated tool environment for modeling, validation and
verification of real-time systems modeled as networks of
time automatons, extended with data types (limited integers,
ararrays, etc.) It was jointly developed by the universities of
Uppsala (Sweden) and Aalborg (Denmark). It allows for the
analysis of the network of timers communicating through
binary synchronization and using broadcast or reception
channels. Automata with added entire variables, clock tables,
emergency,. . . Transitions manipulate two kinds of variables:
clocks that evolve synchronously over time and discrete
bounded variables. The state of the automaton may contain
a condition on the clocks, called invariant, which must be
satisfied by the time in this state. The passage of the automata
is marked by:
-A guard, which expresses a condition on the values of the
variables (true by default).This condition should generally
be compatible with the invariant of the original state of the
transition, and it must be satisfied to make the transition.
-A synchronization of the form ! Or ?, the lack of
synchronization indicating the automaton’s internal action.
-Reset some clocks and update certain variable s Whole.
2)
Formally Description: The system modeled in this study
is the simplified system of an assembly line. This system
includes the following three synchronized processes.
Assembly line: this system contains two systems, Ro-
robotic arms, and the supply chain will be synchroniza-
tion between them.
Robotic arm: this system starts from what we need of
irons plats and in other steps, the installation form of the
car with irons plats form.
Supply chain: this system can change steps between
(step to another step). Declaration of variables and system
assembly line by UPPAAL: The Uppaal tool is made up
of 3 main parts: - A graphic editor where timed processes
can be described,
-
A graphic simulator where you can have a view of the
behavior of the system.
-
A checker that allows you to check the different
properties.
The editor itself is made up of two parts:
Declaration: Contains whole variables, clocks, channels of
synchronization, and constants. Chan movebras, prenves,
tachves, rutilise, finbras, stopbras, stopchain,
movechain, finchain, removechain, removebras;
Clock x;
Bool mrc;
System declaration: Contains processes.
// Place template instantiations here. brasd = bras();
Chemad =chem ();
Asemblylin =asemblyline ();
// List one or more processes to be composed into a system.
System asemblylin, brasd, chemad;
3)
Business Process Actions: - movebras: Starting move
of robotic arms.
- prenves: Download the required installation tools.
-tachves: Install the required tools.
-Rutilise: Restarting move of Robotic arms.
-Finbras: fin moving of Robotic arms.
-Stopbras: Stop moving of arm in short time.
- StopChain: Stop moving of chain in short time.
-movechain: Starting move of chain.
-finch: Fin moving of chain.
-removechain: Restarting move of chain.
-removebras: Restarting move of arm.
-
chnmove: Counter moving of supply chain.
-
position: Counter moving of robotic arm.
? : This operation is in sync with another, and this operator
means that the subsystem has to wait for another sub-system
to trigger the action.
! : This operator means that the action is done by this part of
the system.
4)
Formal model specification: An assembly line is made
up of three processes that synchronize with each other as
follows: the assembly line, robotic arms, and supply chain.
They are modeled as state automatons finished in the following
part:
Business process assembly line: This process is the
system of Assembly line; it contains 10 states and it is
synchronized with two other robotic arm and supply chain
susubsystems‘Fig. 2”:
Business process Robotic arms: This process is the
system of robotic arms; it contains 6 states, and it is syn-
chronized with the assembly line subsystem. (“Fig. 3”):
Fig. 2. Business process models of assembly line
Fig. 3. Business process models of robotic arm
Business process supply chain: This process is the
system; it contains 6 states and is synchronized with the
assembly line subsystem. (“Fig. 4”):
Fig. 4. Business process models of supply chain
Business Process Synchronization: three processes that
synchronize the assembly line, robot arms, and supply chain:
For example, synchronization between ”robot arms” and
”supply chain,” where stop—bras” sends an action to and
”delete—bras” ”receives this action to check what has been
done or not.
Business Process Guards: Keepers express conditions
regarding clock variables and variables that must be met.
Formally, the keepers are a combination of time constraints
and constraints on whole variables. For example, in the
subsystem ”chain,” there is a guard (vair-true) between states
e1 and e2.
Business Process Reset operation: Retying a clock or
variable transition data is an initialization of the value of
the clock. For example, in the under ”bras” system, in the
transition between (s1 and s2), after the 4 minutes (in the
”chain” process), it was reset to zero to calculate the time it
was closed bras valve for up to restart the system.
5)
Verification and validation of formal system modeled:
The purpose of verification is to ensure that a program meets
many characteristics. Model checking is an automatic formal
verification technique, for which it is necessary to formally
model the behavior of the system. In addition, the temporal
logic CTL has also been well presented; it is an attribute
specification language. Once the system is described by the
conversion system and the required attributes are specified
in the timelogic, an algorithm called model checking will
automatically answer the question, ”Does the system meet the
required attributes?”. We have written the formal model that
standardizes a pipeline to the idea CTL time logic as shown
in the“Fig. 5”
Using the Model-Checking algorithm as a formal verification
technique to prove the safety of this formal specification under
temporal logic assembly line system CTL, the satisfaction of
the properties requested is proved.
Fig. 5. Formal specification of assembly line CTL
On the other hand, UPPAAL is used as a simulator to
validate the behavior of this system in order to show if there
are problems such as infinite loops, blocking, etc., and it
proves the system validation. The formal method involves
the application of mathematical techniques to design and
implement software. A formal specification is expressed in
a language whose syntax and semantics are formally defined.
Model-based specification uses a set of theory, function, and
logic tools to develop abstract models of the system.
D.
Implementation
This part is dedicated to the implementation of the code us-
ing the programming language (Java) [9] and the development
environments used to build our application to simulate system
assembly line cars. By using information fromthe modeling,
and it treats the different steps of code generation of the code
and is interested in the passage according to the specificities
of the various types of semi-formal and formal models (UML)
and formal paradigms (TCL) already developed towards pro-
gramming and presents the different scenarios of this simu-
lation system by showing multiple graphical user interfaces
(GUI)
In order to allow the manager to operat all systm’s set-
ting,“Fig. 6” presents the parametric interface. and“Fig. 7”
presents an interface line chart of factory production rate. Also,
“Fig. 8” presents an interface report of factory production rate:
Fig. 6. Interface of parameter setting
Through the use of the formal specification and validation
in this system, it has been noticed that a lot of benefits before
implementation are deducted on below: - Build a bug-free
system before building it in reality. - If there are errors,
they can find out the defect is in a short time. - Select all
components and supplies (method, attribute, conditional, etc.)
in a defined way.
-
The specification takes a long time, but this study enables
us to build a system in a short time.
-
Distribution of services, clarifies tasks, and specifies
interfaces.
Fig. 7. Interface line chart of factory production rate
the database coding phase; and finally, the sequence diagram
and the activity diagram, each of which depicts the dynamics
of our business process, whether it be private or public.Second,
the specification step makes use of the UPPAAL tool, which
enables the creation, validation, verification, and simulation
of synchronized timed finite-state automata in the following
ways: The supply chain, robotic arm, and assembly line are
described formally as the stat machine synchronous automat.
UPPAAL: BusinessProcess Actions: Variable declaration and
system assembly line. formal confirmation of the properties of
the system-modeled utilizing TCL’s model checking. Analysis
of the formal specification gain. Validation of the Formal
System Modeled (Simulation).
In order to create the best simulation application of a set
of public business process management of the assembly line
car, the implementation phase of our work also makes use
of several programming languages and guarantees logical
matching of the previously created diagrams and requirements.
We intend to enhance this project in subsequent work by
including the following features: Add more features, create
an Ubuntu version, and make this online application widely
accessible.
Fig. 8. Interface report of factory production rate
IV.
CONCLUSION
Simulating and formally defining a critical synchronous
system as a collection of business processes was the aim of this
effort. Because it offers a set of suitable internal and external
business management processes to create a system that assures
the best quality and is most suitable for vehicle installation, we
use the current contributions in a case pertaining to an auto-
mobile installation and manufacturing agency.Three life cycle
engineering phases are followed in this work: First, regardless
of the programming language, the design phase makes use
of semi-formal models like UML, which combine structured
description and behavior, to comprehend challenges and rep-
resent and model objects using the following diagrams: a case
diagram that shows our system’s functionality from various
perspectives (technician/administrator); a diagram of classes
that illustrates the static structure of our system and aids in
REFERENCES
[1]
Weske, M., ‘Business Process Management: Concepts, Languages, and
Architecture’, third Edition book, Springer-Verlag GmbH Germany, pp.
3, 2019.
[2]
Hammer, M. Introduction. In Jan vom Brocke and Michael Rosemann
(2nd ed.), “What is business process management?” In Springer-Verlag
Berlin Heidelberg (Ed.). Hand book on Business Process Management,
Methods, and Information Systems, Cambridge, USA, pp. 6, 3, 2015.
[3]
Rosemann, M., ‘An Exploration into Future Business Process Manage-
ment Capabilities in View of Digitalization’, Georgi Dimov Kerpedzhiev.
[4]
assembly-line website:(https://www.inboundlogistics.com/articles/assembly-
line/), visited the 22/12/2024.
[5]
UML website: (https://www.uml.org/), visited on 03/12/2024.
[6]
uppal website: (https://uppaal.org/), visited the 20/12/2024.
[7]
Stefan, S., “Model Checking Concepts,” ppt corse, ENSIIE, 2024.
[8]
Massimo, B., Laura, B., Fabio, M.: “Full Characterisation of Extended
CTL*”, Universita` di Napoli Federico II, 31st International Symposium
on Temporal Representation and Reasoning (TIME 2024).
[9]
Java website: (https://www.java.com/), visited the 22/12/2024.
[10]
Mohammedn, N. L., Nabil, H., Ramdane, M., “Resources consumption
analysis of business process services in cloud computing using Petri
Net”, Journal of King Saud University Computer and Information
Sciences, pp. 408.
[11]
van der Aalst W.M.P., van Hee K.M., Business process redesign: A
Petri-net-based approach”, Computers in Industry Volume 29, Issues 1
2, July 1996, pp. 15.
[12]
Wangyang, Y., Jie, F., Lu, L., Xiaojun, Z., and Yumeng, C., “Enhancing
security in e-business processes: Utilizing dynamic slicing of Colored
Petri Nets for logical vulnerability detection”, Future Generation Com-
puter Systems, Volume 158, September 2024, pp. 210.
[13]
Simone, A., Francesco, C., Fabrizio, M.M, Andrea, M., Fabio, P., Pro-
cess mining meets model learning: Discovering deterministic finite state
automata from event logs for business process analysis”, Information
Systems Volume 114, March 2023, pp. 1.
[14]
Peng, Q., Quanyi, H., Menglin, C., “Towards machine-readable
semantic-based E-business contract representations using Network of
Timed Automata (NTA)”, Future Generation Computer Systems Volume
158, September 2024, pp. 457.
[15]
Xin, Y., Xinghua, B., Chao, Z., “The reduction and deadlock detection
of cross-organizational business process based on Pi-calculus”, Procedia
Engineering Volume 15, 2011, pp. 3487.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
Predicting Fire Forest in Algeria : A new Approach
Houda EL BOUHISSI, Naima ILLOUL
University of Bejaia, Faculty of Exact Sciences, LIMED Laboratory, 06000, Bejaia, Algeria
Abstract Forest fires have emerged as a major concern,
drawing international attentionespecially in Algeria. They
are increasingly recognized by the global community as one of
the most critical security challenges of our time. This study
examines the current protective measures in place to combat
these fires and evaluates their effectiveness in preserving
the country’s environment from devastating damage.
Numerous forecasting methods exist, with those leveraging
artificial intelligenceparticularly machine learning and
related technologiesbeing the most widely used. These
AI- driven techniques have led to the development of
adaptive and reliable systems across various domains, especially
in predictive modeling. In this work, we apply such methods
to forecast forest fires. The aim of this paper is to introduce
a novel hybrid approach that combines machine learning
with bioinspired algorithms to enhance forest fire prediction.
Experimental results demonstrate that integrating bioinspired
algorithms significantly improves the performance of
machine learning models.
Keywords machine learning, Kaggle, datasets, forest fire
prediction, logistic regression.
I.
INTRODUCTION
Forest Fires are among the world’s most dangerous natural
disasters in the world. They cause catastrophic losses to
forest ecosystems and pose a serious threat to human safety
and property.
Forest fires can cause devastating damage to ecosystems,
animals, and human habitats. They destroy vast areas of
forest, resulting in the loss of biodiversity and natural
habitats. Many animals are killed or displaced, and
endangered species are under even greater threat. Algeria is
one of the countries
mostly affected by these disasters each year.
The considerable risk associated with these events has
led to significant concern among stakeholders, who are
questioning the effectiveness of protection measures against
these powerful fires and their ability to safeguard the
country’s environment.
The disruption of local economies, damage to homes and
infrastructure, and the occurrence of forest fires are all also
consequences of this. Communities often suffer emotionally
and financially from the loss of property and livelihoods.
Recovery from such disasters takes years. It also requires
significant resources. Preventive measures and awareness are
crucial to reducing the frequency and severity of forest fires.
Despite the efforts made by the protection services to avoid
them, this problem remains a major risk for the country’s
environment and the safety of its population. The damage
and danger left behind by these fires worry officials and
associations in the country who are trying to find
immediate solutions to put an end to this disaster by
providing all the necessary equipment. The experience of
all these years proves that despite the immediate intervention
of the protection services, it still generates a significant rate of
damage, the country remains imperiled by these forest fires.
For this reason, building a forest fire prediction system
seems like a very good solution to prevent risks and
reduce the damage.
The aim of this paper is to propose a hybrid approach
to predict forest fires in Algeria using Artificial
Intelligence techniques. The development of a prediction
system to classify the possibility of forest fires into two
categories (fire and non-fire) is essential for the study.
This hybrid system will represent a different technique
and approach proposed and its results. Our objective is to
implement an efficient prediction system based on
supervised machine learning to predict forest fires using
machine learning and bioinspired algorithms.
The rest of the paper is organized as follows. Section
2 offers a comprehensive overview of the fundamental and
most significant approaches associated with Fire Forest.
In section 3, we present in detail our classification
approach. Then in section 4, we present an empirical
study of the proposed approach to assess its performance
and efficiency. Finally, Section 5 concludes the paper and
establishes the opportunity for future work.
II.
RELATED WORKS
In the context of data analysis, certain
methodologies
employ
machine
learning
algorithms to
make predictions. Conversely, alternative approaches
utilize artificial neural networks, employing deep
learning methodologies to enhance the precision of
predictions.
In the following, we will present the main works.
The authors in [1] present a predictive model based on
the decision tree for forest fire prediction in Algeria.
The
data
used
is collected
from
two
regions
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
of North Algeria: Sidi Bel Abbe`s and Bejaia. The
meteorological data with three attributes that influence fire
occurrences are used, namely temperature, relative humidity,
and wind speed. Results show that the decision tree is
suitable for this purpose, since it gives significant
performances, and it can be translated to rule based.
Another approach proposed by [2].The authors combine
historical fire occurrence data from the Fire Information for
Resource Management System. The meteorological and
topographic variables are then derived and processed for
the creation of high- resolution maps. These maps serve
as an effective decision-support tool for analyzing fire
behavior.
The proposal of [3] consists of creating a system which
integrates
weather
data.
An
exploratory analysis was
conducted, followed by preprocessing aimed at eliminating
noisy data and converting categorical variables into
numerical ones, thereby enhancing the clarity of the dataset.
The regression techniques employed for prediction purposes
include Random Forest, Decision Trees, Support Vector
Regression, and Naive Bayes.
The authors in [4] proposed a novel approach using logistic
regression, to predict forest fire risk in the Lijiang region.
This approach makes it possible to assess the influence
of various factors on the study subject, such as
topography (altitude, slope, orientation), vegetation and
weather conditions (precipitation, temperature, wind,
humidity).
In addition, [5] propose a new method, namely parallel
SVM, for reliable performance of forest fire prediction. The
data used consists of weather data from the Indian
region. This type of solution can help very well with the
detection of the fires before they destroy the whole forest
and simplify the prediction of these forest fires.
An interesting approach proposed by [6] which consists
of a fire prediction system. This system utilizes satellite
images
obtained. By integrating artificial intelligence and
supervising the learning of neural networks with satellite
remote sensing technology, Agni optimizes the use of
satellite images for forecasting high-risk fire areas. The
model has demonstrated consistent performance through
extensive evaluations.
Several papers are covered that explain in detail forest
fire prediction methods that can help produce interesting
results.
The reviewed works present a variety of approaches to
forest fire prediction, ranging from simple decision trees to
advanced AI-based systems. Some methods focus on
using basic meteorological data, offering easy
interpretation but limited accuracy due to the
exclusion of other important factors like vegetation or
topography.
Others incorporate a wider range of variables and produce
high-resolution risk maps, which are valuable for planning
but may lack real-time responsiveness.
Different studies emphasize data preprocessing and model
comparison, yet they often overlook a clear analysis of
performance differences.
While logistic regression allows for understanding factor
influence, it may not capture complex patterns effectively.
More recent approaches integrate satellite imagery and AI,
showing promising results but requiring significant
computational resources.
Overall, the works complement each other, but shared
challenges remain, such as ensuring model generalization,
balancing
complexity
with
usability, and achieving timely
predictions.
The method used in this study is a combination of LR
for prediction and Particle swarm intelligence (PSO) for
feature selection optimization.
we aim to implement an efficient and useful prediction
system based on machine learning. Du to the importance of the
data feature, we use logistic regression algorithm.
Next, we present our approach in detail.
III.
PROPOSED APPROACH
The methodology proposed in this study aims to predict
forest fire in Algeria using a hybrid approach which is a
combination of LR and PSO.
The system has been developed for the purpose of
classifying forest regions as either fire” or ”non-fire” risk,
with this classification being determined by meteorological
conditions.
The system architecture is presented in figure 1 and
involves four steps.
The first step involves data collection.
The second step concerns data processing and includes
many phases (cleaning, …etc.).
The third step is about feature selection using PSO.
And finally, the last step concerns the predication process.
Following, we will describe these steps in detail.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
Fig. 1. System architecture.
Data Collection :
The process begins with gathering relevant data. In this
research, meteorological and environmental data were
collected from two forested regions in northern Algeria:
Bejaia and Sidi-Bel-Abbes [1].
These regions were selected due to their historical
vulnerability to forest fires, providing valuable and relevant
data for the study.
Data Preprocessing
An initial analysis was performed to understand the structure
and characteristics of the dataset. This included examining
statistical distributions, identifying missing values, and
detecting potential outliers.
A key component of this step is the creation of a heat
map showing the correlation coefficients between variables,
which helped to reveal strong relationships between certain
features and the target variable (fire occurrence).
Before training the model, the data followed several
preprocessing phases :
Data Cleaning: Managing missing values, correcting
inconsistencies, and filtering out noisy or irrelevant
data.
Data Transformation: Converting categorical
variables into numerical formats (e.g., using one-hot
encoding), normalizing or standardizing continuous
features, and performing dimensionality reduction if
needed.
Splitting the
Dataset The processed dataset was then divided into
two subsets: Training Set: Used to train the model by
allowing it to learn patterns from historical data. Test
Set: Used to evaluate the model's performance on
unseen data, ensuring its ability to generalize well.
Feature Selection:
Particle Swarm Optimization (PSO) is a bio-inspired
optimization algorithm used to find optimal solutions in
complex search spaces [7]. When we apply PSO to the forest
fire dataset [1], PSO can be used to select the most relevant
features (e.g., temperature, humidity, wind, rain) that contribute
significantly to predicting fire occurrences or burned areas
(figure 2).
Fig. 2. How PSO works [8].
Each particle in the swarm represents a candidate feature
subset, encoded as a binary vector indicating which features are
included. The fitness of each particle is evaluated and trained on
the selected features. The particle updates its position based on
its best-found solution and the best-known global solution in the
swarm. This cooperative behaviour allows PSO to explore the
feature space effectively and avoid local minima.
PSO, by iterating over generations, converges toward an
optimal feature subset that maximizes predictive performance
while minimizing feature count. In forest fire prediction, this
results in simpler, faster, and more interpretable models. It helps
in identifying environmental variables most critical for early fire
detection or damage estimation.
Redundant or irrelevant features (e.g., noise variables or
highly correlated inputs) are naturally excluded during the
process. PSO thus contributes to better generalization, lower
computation cost, and enhanced decision-making in forest fire
management. This makes it a valuable tool in environmental
data analysis and risk forecasting systems.
Prediction:
LR is applied to the training data to build the predictive
model. After training, the model was tested on the test set, and
performance metrics such as accuracy, precision, recall, or F1-
score were calculated to assess its effectiveness.
Our approach is based on logistic regression, a supervised
classification algorithm that has shown an effective efficiency
for binary outcome prediction.
IV.
EXPERIMENT AND EVALUATION
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
To validate the effectiveness of our predictive model, we
perform a series of experiments on a meteorological dataset
gathered from the regions of Bejaia and Sidi-Bel-Abbes in
Algeria [1]. This dataset is a structured collection of
meteorological and fire-related data compiled to facilitate the
prediction of forest fires in Algeria. It encompasses daily
observations from June to September 2012, focusing on two
regions: Bejaia in the northeast and Sidi Bel-Abbes in the
northwest. Below is a detailed description of the dataset:
Total Instances: 244
Bejaia: 122 instances
Sidi Bel-Abbes: 122 instances
Time Frame: June to September 2012
Class Distribution:
Fire Occurrences: 138 instances
No Fire Occurrences: 106 instances
The dataset comprises 12 attributes (table 1), including
meteorological variables, Fire Weather Index (FWI)
components, and a target class label:
TABLE I
DATASET ATTRIBUTES [1]
Learn, Pandas, and Matplotlib. After pre-processing the
data, it was split into training instructions and test sets, main
feature are selected then we apply the LR model for prediction.
We perform different experimentations using LR alone
and using LR with PSO. Based on the available information,
here is a comparative analysis of two approaches for forest
fire prediction.
Both models utilize the Algerian Forest Fires Dataset [1],
which
includes
meteorological
observations
and Fire
Weather Index components from the Bejaia and Sidi Bel-
Abbes regions.
TABLE 2
EXPERIMENT RESULTS [1]
Metric
LR (%)
LR + PSO (%)
Accuracy
85.00
87.00
Precision
83.00
85.00
Recall
86.00
89.00
F1-Score
84.00
87.00
The results of applying PSO to LR for forest fire
prediction show clear improvements across all evaluation
metrics :
Accuracy increased from 85% to 87%, indicating that
the optimized model makes fewer overall classification
errors.
Precision increased from 83% to 85%, meaning
the model is better at minimizing false positives—
crucial for avoiding unnecessary fire alerts.
Recall improved from 86% to 89%, showing that the
model detects more actual fire events, reducing
the risk of missing dangerous situations.
The F1-score also increased from 84% to 87%,
demonstrating a better balance between detecting
fires and maintaining prediction reliability.
These gains confirm that PSO effectively selects the most
relevant environmental features, such as temperature,
wind,
and
humidity,
while
discarding noisy or redundant data.
As a result, the LR becomes more focused, interpretable,
and robust. The reduced feature set also lowers computational
costs, enabling
faster.
The experiment was performed within a Python
environment, with the utilization of libraries such as Scikit-
V.
CONCLUSION
In this paper, we propose a model for forecasting forest
fires in Algeria. To this end, we have used the logistic
regression algorithm as the underlying framework. The
present study focuses on the
regions
Attribute
Description
1
Date
Observation date in
DD/MM/YYYY format.
2
Temperature
Temperature at noon in
degrees Celsius
3
Relative Humidity (RH)
Percentage of humidity
4
Wind Speed (Ws)
Wind speed in km/h.
5
Rain
Total daily rainfall in mm
6
Fine Fuel Moisture Code
(FFMC)
Represents the moisture
content of litter and fine
fuels
7
Duff Moisture Code (DMC)
Indicates the moisture
content of loosely
compacted organic layers
8
Drought Code (DC)
Reflects the moisture
content of deep, compact
organic layers
9
Initial Spread Index (ISI)
Combines wind and FFMC
to estimate the rate of fire
spread
10
Buildup Index (BUI)
Combines DMC and DC to
represent the total amount
of fuel available
11
Fire Weather Index (FWI)
Indicates the potential fire
intensity
12
Classes
Binary classification
indicating fire occurrence:
'fire' or 'not fire'
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11-12, 2025, Oran, Algeria.
of Bejaia and Sidi-Bel Abbes.
The methodology employed is based on the exploitation
of meteorological data accessible via the Kaggle platform.
The gathered information was incorporated into a
comprehensive plan that included data collection, analysis,
model construction, and forecasting. Logistic regression was
chosen as the optimal approach for several reasons.
This approach is characterized by its ease of implementation.
Additionally, it is noteworthy for the clarity and precision of
his interpretation. Eventually, it distinguishes himself through
his remarkable precision
in the classification process.
Experimental results have demonstrated the proposed
approach’s capacity to accurately differentiate between fire-
prone and fire- prone
conditions
based
on
meteorological
characteristics. This predictive capability is crucial for the
rapid deployment of preventive measures and fire- fighting
resources.
Additionally, the incorporation of meteorological data
into existing early warning systems has proven to be
highly relevant, enabling the effective management of risks
associated with climatic phenomena.
This system has proven to be extremely effective, even
in the absence of substantial resources. Moreover, it is
particularly advantageous because of its ability to integrate
with pre-existing early warning systems, even in contexts
where resources are limited.
In future, we considered other hybrid approaches based
on deep learning and bioinspired algorithms to improve
accuracy.
REFERENCES
[1] F. Abid and N. Izeboudjen, Predicting forest fire in Algeria using
data mining techniques: Case study of the decision tree algorithm,” in
Proc. Int. Conf. Adv. Intell. Syst. Sustain. Dev.,
pp. 363370, Springer, 2019.
[2] I. Elkhrachy et al., “Sentinel-1 remote sensing data and hydrologic
engineering centres river analysis system two- dimensional integration
for flash flood detection and modelling in New Cairo City, Egypt,”
J. Flood Risk Manag., vol. 14, no. 2, p. e12692, 2021.
[3] T. Preeti, S. Kanakaraddi, A. Beelagi, S. Malagi, and A. Sudi, “Forest
fire prediction using machine learning techniques,” in Proc. Int.
Conf. Intell. Technol. (CONIT), pp. 1–6, IEEE, 2021.
[4] L. Si et al., “Study on forest fire danger prediction in plateau mountain-
ous forest area,Nat. Hazards Res., vol. 2, no. 1, pp. 25–32, 2022.
[5] K. R. Singh, K. P. Neethu, K. Madhurekaa, A. Harita, and P. Mohan,
“Parallel SVM model for forest fire prediction,” Soft Comput. Lett.,
vol. 3, p. 100014, 2021.
[6] B. Zheng et al., “Increasing Forest fire emissions despite the decline
in global burned area,” Sci. Adv., vol. 7, no. 39, p. eabh2646,
2021.
[7] H. El Bouhissi, A. Ziane, L. Rahmani, M. Medbal, & M. Kostiuk,
(2023). RF-PSO: An Optimized Approach for Diabetes Prediction. In ICST
(pp. 227-238).
[8] Q. Nizamani, A. A. Hashmani, Z. H. Leghari, Z. A. Memon, H.
M. Munir, T.Novak & M. Jasinski (2024). Nature-inspired
swarm intelligence algorithms for optimal distributed
generation allocation: A comprehensive review for minimizing
power losses in distribution networks. Alexandria Engineering Journal,
105, 692-723.
[9] R. Bekka, S. Kherbouche,
and H. El Bouhissi. Distraction detection
to
predict
vehicle
crashes:
a
deep
learning approach.
Computación y Sistemas, 26(1), 373-387, 2022.
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11–12, 2025, Oran, Algeria.
Deep Learning-Based Classification of Knee
Osteoarthritis Using Gaussian Noise Augmentation
and Knowledge Distillation
1st Khadidja Messaoudene
LIMOSE Laboratory
University M’Hamed Bougara of Boumerdes
Boumerdes, Algeria
k.messaoudene@univ-boumerdes.dz
2nd Khaled Harrar
LIST Laboratory
University M’Hamed Bougara of Boumerdes
Boumerdes, Algeria
khaled.harrar@univ-boumerdes.dz
Abstract—Knee osteoarthritis (KOA) is a degenerative joint
disease characterized by cartilage deterioration, leading to pain,
stiffness, and impaired joint function. Accurate detection and
grading are crucial for early intervention, but challenges such as
limited data, redundant features, and suboptimal classification
performance hinder reliable diagnostic tools. This study pro-
poses an effective pipeline for classifying KOA into Kellgren-
Lawrence (KL) grades 0 (no OA) and 2 (moderate OA) using
data augmentation, deep feature extraction, and refined feature
selection. The method was tested on 688 knee radiographs
from the Osteoarthritis Initiative (OAI), with regions of interest
(ROIs) extracted and augmented using Gaussian noise. Deep
features were obtained via DenseNet-201, followed by knowledge
distillation for feature selection, and classification was performed
using a fine Gaussian Support Vector Machine (GSVM) with
5-fold cross-validation. The pipeline achieved 94.5% accuracy
and a 96% AUC, whereas omitting feature selection reduced
accuracy to 82%, and excluding augmentation lowered it to
88%, underscoring their importance. The integration of Gaussian
noise augmentation, DenseNet-201, and knowledge distillation
significantly enhanced classification performance, demonstrating
strong potential for improving automated diagnostic systems and
supporting early KOA detection and clinical decision-making.
Index Terms—knee OsteoArthritis, X-ray images, Knowledge
distillation, DenseNet-201, GSVM.
I. INTRODUCTION
Knee Osteoarthritis (KOA) is a prevalent degenerative joint
disease characterized by the gradual deterioration of articular
cartilage, leading to pain, stiffness, and reduced mobility
,It is the most common form of arthritis, affecting millions
of individuals worldwide, particularly those over the age of
50 [1]. Epidemiological studies indicate that the incidence
of KOA is increasing, influenced by factors such as aging
populations, obesity, and joint injuries [2].
Radiographic imaging plays a crucial role in the diagnosis
and evaluation of KOA, This imaging modality is essential
for assessing structural changes in the knee joint, such as
cartilage loss, bone marrow lesions, osteophyte formation,
and joint space narrowing. The severity of KOA is typi-
cally classified using grading systems such as the Kellgren-
Lawrence (KL) scale [3], which categorizes the disease into
stages ranging from 0 (no radiographic features of OA) to
4 (severe OA with extensive joint damage). Accurate staging
is vital for determining appropriate treatment strategies and
monitoring disease progression. However, manual assessment
of OA severity through imaging is often subjective and time
consuming, leading to variability in diagnosis and staging.
To address these challenges, there is a growing necessity for
the development and implementation of automatic detection
and classification methods through advanced image processing
techniques. Deep learning [4] has emerged as a transformative
approach in medical imaging, offering enhanced accuracy and
efficiency in disease detection and classification. By leveraging
large datasets and advanced neural networks, deep learning
models can identify subtle patterns and features in medical
images that are often imperceptible to the human eye. This
capability is particularly beneficial for KOA diagnosis, where
early detection and precise classification are paramount for
effective treatment and improved patient outcomes.
Several recent studies have explored automated classifica-
tion approaches for knee osteoarthritis (OA) diagnosis using
various feature extraction techniques and classifiers. Janvier
et al. [5] employed fractal analysis coupled with logistic
regression, achieving an accuracy of 73% . In 2018, Riad
et al. [6] applied the Dual-Tree Complex Wavelet Transform
(DTCWT) with an SVM-RBF classifier, reporting a higher
accuracy of 80.38%. Brahim et al. [7] utilized Power Spec-
tral Density (PSD) features alongside logistic regression and
obtained an accuracy of 78.92%. More recently, Ribas et
al. [8] implemented a convolutional neural network (CNN)-
based approach, achieving an accuracy of 81.69% on the
same dataset. These methods demonstrate the evolution from
traditional texture-based techniques to deep learning models,
with CNNs showing promising improvements in classification
performance.
II. MATERIALS AND METHODS
A. Dataset
The dataset used in our experiment was obtained from
the publicly accessible OAI [9]. The dataset comprises 688
radiographs of the knee, specifically focusing on the medial
1
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11–12, 2025, Oran, Algeria.
ROI of the tibia. These radiographs have been categorized
using the Kellgren and Lawrence rating system (KL0, KL2).
We compared grade (no OA) to disease overall grade KL2
(mild OA).Figure below shows the ROI used in our work.
Fig. 1. ROI used
B. Methods
This section outlines the workflow of our framework for
KOA classification, as illustrated in Figure2 . The pipeline be-
gins with Gaussian noise-based augmentation to mitigate class
imbalance in the dataset of 688 knee radiographs (evenly split
between KL grades 0 and 2) from the OAI dataset. Following
preprocessing and ROI extraction to focus on anatomically
relevant regions, the method employs DenseNet-201 for deep
feature extraction, leveraging its dense connectivity pattern
for comprehensive feature representation. These features are
then refined via knowledge distillation (KD) to prioritize the
most discriminative characteristics while reducing redundancy.
Finally, the optimized feature set is classified using a fine
Gaussian Support Vector Machine (GSVM), selected for its
effectiveness with high-dimensional medical data.
1) Data augmentation: A major challenge in KOA detec-
tion is the limited availability of annotated medical imaging
data, which can restrict the performance of diagnostic models
. To address this limitation while ensuring the preservation of
critical pathological features, we employ Gaussian noise-based
data augmentation. This technique enhances dataset diversity
by introducing controlled variations that mimic real-world
imaging noise characteristics while maintaining the structural
integrity of radiographic findings [10].
The application of GNDA involves the addition of Gaussian
noise to the original dataset. This noise is mathematically
modelled as a Gaussian distribution with a mean (µ) of
zero and a given variance (σ2), essential for maintaining the
integrity of the original data’s distribution while introducing
variability.
The GNDA can be represented by the following equation :
X=X+ϵ(1)
where:
Xrepresents the original data sample.
ϵdenotes the Gaussian noise, a random variable drawn
from N(0, σ2).
Xis the resultant augmented data point.
The Gaussian distribution, defined by the probability density
function (PDF), is given as:
f(x|µ, σ2) = 1
2πσ2exp (xµ)2
2σ2(2)
2) Features extration: DenseNet201 [11]is a convolutional
neural network known for its dense connectivity, where each
layer receives inputs from all previous layers, enhancing
feature reuse and gradient flow. It has 201 layers organized
into dense blocks and transition layers, enabling efficient and
rich feature extraction from images. Typically, input images
are resized to 224×224 pixels, and the model outputs high-
dimensional feature maps , which can be used for various
classification tasks. DenseNet201’s architecture reduces the
vanishing gradient problem and improves learning efficiency,
making it effective for extracting detailed and discriminative
features in applications like medical image analysis and object
recognition
3) Features seletion: Feature-based KD involves trans-
ferring internal representations from a teacher model to a
student model [12], allowing the student to learn the intricate
structures and relationships embedded in the teacher’s feature
maps (Figure 3). This method offers a more comprehensive
knowledge transfer compared to merely replicating output
probabilities.
Fig. 3. The generic teacher-student framework for KD
The loss function in feature-based KD is designed to align
the intermediate feature representations of the teacher and
student networks, typically formulated as:
2
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11–12, 2025, Oran, Algeria.
Fig. 2. The proposed methods
Lfeature =1
N
N
X
i=1 Ft
TFt
S2
2(3)
Where:
Lfeature represents the feature-based KD loss.
Ndenotes the number of feature layers or map points
included in the distillation process.
Ft
Tand Ft
Scorrespond to the feature maps of the teacher
and student networks, respectively.
∥·∥2
2indicates the squared Euclidean (L2) norm, mea-
suring the difference between the teacher’s and student’s
feature maps.
The total KD loss combines the feature-based distillation
loss with the task loss (typically cross-entropy loss), and is
expressed as:
LKD =αLCE + (1 α)Lfeature (4)
Here, αis a hyperparameter that determines the weighting
between the feature-based distillation loss and the cross-
entropy task loss.
4) Classification: After the feature selection process, where
the most informative DenseNet features were identified
through knowledge distillation (KD), a Gaussian SVM clas-
sifier was employed for the binary classification of KOA
into KL grades 1 and 2. The Gaussian SVM , utilizing
a radial basis function (RBF) kernel, was chosen for its
ability to model complex nonlinear decision boundaries in
the high-dimensional feature space derived from DenseNet.
This approach enabled the classifier to effectively discriminate
between subtle structural and textural variations in knee joint
regions, as encoded by the distilled feature representations
III. RESULTS AND DISUSION
The bar chart in figure 4presents a performance comparison
of different data augmentation strategies used in conjunction
with a Densenet201-based model, measured by classification
accuracy. Three configurations are shown: without GNDA
(Gaussian Noise Data Augmentation), without KD (Knowl-
edge Distillation Data Augmentation), and the full model
combining GNDA + Densenet201 + KD. When GNDA is
excluded, the accuracy drops to 82.6%, indicating that Gaus-
sian noise augmentation significantly contributes to improving
model generalization and robustness. When Knowledge Dis-
tillation is removed, accuracy is 88%, suggesting that it also
provides valuable performance gains, though slightly less than
GNDA in this context. The best performance is achieved by
the full combination GNDA + Densenet201 + KD with an
accuracy of 94.5%, demonstrating that the integration of both
augmentation techniques leads to the most effective learning
and model performance.
The figure5 displays the Receiver Operating Characteristic
(ROC) curve, a common tool for evaluating the performance
of a binary classifier. The true positive rate (sensitivity) is
plotted against the false positive rate at various threshold
settings. The blue curve represents the ROC, and the light
blue shaded area under it corresponds to the Area Under
the Curve (AUC), which in this case is 0.99. An AUC of
0.99 indicates excellent classifier performance, suggesting that
the model has a very high ability to distinguish between the
two classes. Additionally, a highlighted point on the curve
3
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 11–12, 2025, Oran, Algeria.
without GNDA without KD GNDA+Densnet201+KD
0
20
40
60
80
100
82.688 94.5
METHODS
Accuracy (%)
Fig. 4. Performance comparison of different methods.
at coordinates (0.07, 0.96) represents the current operating
point of the classifier meaning it achieves a true positive
rate of 96% with only a 7% false positive rate. This balance
between sensitivity and specificity demonstrates that the model
is highly effective and reliable for the classification task.
Fig. 5. The ROC curve
Table I summarizes the performance of various classification
methods on a dataset of 688 samples. Traditional approaches
like fractal analysis with logistic regression [5] achieved
limited accuracy (AUC 0.73). Improvements were seen with
DTCWT + SVM-RBF [6] and PSD + logistic regression
[7], reaching accuracies of 0.8038 and 0.7892, respectively.
Ribas et al. [8] later introduced a CNN model with a slight
performance gain (accuracy 0.8169). The proposed method
combining GNDA, DenseNet201, knowledge distillation, and
Gaussian SVM outperforms all previous approaches with a
0.99 accuracy, demonstrating the power of deep learning and
optimized classification.
Authors Year Methods Classifier Data Acc
Janvier et
al. [5]
2017 Fractal analysis LR 688 0.73
Riad et al.
[6]
2018 DTCWT SVM-RBF 688 0.8038
Brahim et
al. [7]
2019 PSD LR 688 0.7892
Ribas et
al. [8]
2023 CNN - 688 0.8169
Proposed 2025 GNDA+DensNet201 GSVM 688 0.945
TABLE I
COMPARISON OF METHODS WITH THE PROPOSED APPROACH.
IV. CONCLUSION
This study presented an effective KOA classification
pipeline combining data augmentation, DenseNet-201 feature
extraction, and knowledge distillation-based feature selection,
achieving 94.5% accuracy in distinguishing KL grades 0 and
2. The results highlight the importance of feature selection
and augmentation, as their exclusion significantly reduced
performance. The method shows promise for improving auto-
mated KOA diagnosis, supporting early detection and clinical
decision-making. Future work could expand to multi-class
grading and larger datasets for broader validation.
REFERENCES
[1] Centers for Disease Control and Prevention (CDC),
“Osteoarthritis (OA), 2020. [Online]. Available:
https://www.cdc.gov/arthritis/basics/osteoarthritis.htm
[2] T. Neogi, “The epidemiology and impact of pain in osteoarthritis, Ost.
Cart, vol. 21, no. 9, pp. 1145–1153, 2013.
[3] H. Kellgren and J. S. Lawrence, “Radiological assessment of osteoarthri-
tis, Ann. Rheum. Dis., vol. 16, pp. 494–502, 1957.
[4] H. Greenspan, B. van Ginneken, and R. M. Summers, “Deep learning
in medical imaging: Overview and future promise of an exciting new
technique, IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1153–1159,
2016.
[5] T. Janvier, R. Jennane, A. Valery, K. Harrar, M. Delplanque, C. Lelong,
D. Loeuille, H. Toumi, and E. Lespessailles, “Subchondral tibial bone
texture analysis predicts knee osteoarthritis progression: Data from the
osteoarthritis initiative, *Osteoarthritis Cartilage*, vol. 25, no. 2, pp.
259–266, 2017.
[6] R. Riad, R. Jennane, A. Brahim, T. Janvier, H. Toumi, and E. Lespes-
sailles, “Texture analysis using complex wavelet decomposition for knee
osteoarthritis detection: Data from the osteoarthritis initiative, Comput.
Electr. Eng., vol. 68, pp. 181–191, 2018.
[7] A. Brahim, R. Riad, and R. Jennane, “Knee osteoarthritis detection using
power spectral density: Data from the osteoarthritis initiative, in Com-
puter Analysis of Images and Patterns: 18th International Conference,
CAIP 2019, Salerno, Italy, September 3–5, 2019, Proceedings, Part II,
vol. 18, pp. 480–487, Springer International Publishing, 2019.
[8] L. Ribas, T. Riad, R. Jennane, and O. Brun, A complex network based
approach for knee osteoarthritis detection: Data from the osteoarthritis
initiative, Biomed. Signal Process. Control, vol. 71, p. 103133, 2022.
[9] G. Lester, “The osteoarthritis initiative: A NIH public–private partner-
ship, HSS J., vol. 8, no. 1, pp. 62–63, 2012.
[10] H. X. Dou, X. S. Lu, C. Wang, H. Z. Shen, Y. W. Zhuo, and L. J.
Deng, “PatchMask: A data augmentation strategy with Gaussian noise
in hyperspectral images, Remote Sens., vol. 14, no. 24, p. 6308, 2022.
[11] J. Zhou, X. Gu, H. Gong, X. Yang, Q. Sun, L. Guo, and Y. Pan,
“Intelligent classification of maize straw types from UAV remote sensing
images using DenseNet201 deep transfer learning algorithm, Ecol.
Indic., vol. 166, p. 112331, 2024.
[12] M. Huang, Y. You, Z. Chen, Y. Qian, and K. Yu, “Knowledge distillation
for sequence model, in Proc. Interspeech, pp. 3703–3707, Sep. 2018.
4
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
Lattice Constant Prediction in Cubic Perovskites
via Machine Learning Tools
Nadjeh Zeghdoud #1, Faiçal Djani *1
¹Department of Materials Science, University of Biskra, Biskra, Algeria
1 Nadjeh.zeghdoud@univ-biskra.dz
1 f.djani@univ-biskra.dz
Abstract An adopted dataset of cubic perovskites was screened
using machine learning for lattice constant prediction. The model
was built using support vector regression (SVR) and random
forest (RF) algorithms. According to the training and test, the
prediction of Support vector regression was higher than that of
random forest. The correlation coefficient between the predicted
and experimental lattice constant reached as high as 0.998 for the
training data set and 0.9884 for the testing data. These findings
shed light on the power of machine learning to expedite the
discovery of structural properties in perovskite materials.
Keywords Machine Learning, Perovskite, lattice constant,
Random Forest, Support Vector Regression.
I. INTRODUCTION
In the past decade, cubic perovskites have attracted a lot of
interest because of their unique and flexible optical,
mechanical, and electrical characteristics. These advantages
make them attractive options for a number of applications,
including solar cells, light-emitting diodes, actuators, and laser
cooling devices. The bandgap structure, structural stability,
and consequently the performance of the material are all
significantly impacted by the lattice constant a [1].
Determining lattice constant using traditional methods can
be a lengthy, costly, and labour-intensive process. Because of
this, there has been a shift to using machine learning in
conjunction with existing datasets to predict material
properties [2]. In this work, we propose two predictive models
to predict the lattice constant of cubic perovskites using
random forest and support vector regression algorithms on an
adopted dataset of cubic perovskites. The models will be
evaluated against each other regarding their applicability to
materials science.
II. LITERATURE REVIEW
According to [3] authors studied 132 ideal perovskite
compounds to develop a model to predict lattice constants by
establishing a computational link between the tolerance factor
(𝑡) and the summed radii of the B and X ions. The authors
achieved an average prediction error of 0.63% and a highly
correlated model (R² = 0.995) to aid in substrate design for
use in epitaxial semiconductors, while emphasising the
increased stability that ionic geometry brings to a lattice.
Reference [4] shows an empirical approach to estimate
cubic perovskite lattice constants using the product of the
average ionic radius (𝑟ₐᵥ) and valence electrons for ABX₃. The
authors noted that there was a linear relationship with the
lattice constant and 𝑟ₐᵥ, but that various valence electron
products lead to various linear relationships. Their empirical
tool was able to achieve even better agreement with
experimental work than previous models.
In addition, the lattice constants of ABX₃ cubic perovskites,
including oxides and halides, were predicted using a Gaussian
Process Regression (GPR) model [5]. The quantity of valence
electrons and the ionic radii of the constituent ions were
among the descriptors considered for the model. Strong
predictive accuracy was attained by the model (R2 = 0.9987,
RMSE = 0.02104 Å).
Also, machine learning was utilized [6], testing four
methods with a dataset of 122 ABX₃ compounds utilizing a
combination of features such as density, atomic number,
electronegativity, and ionic radii. The GPR model performed
best, showing R2 = 1.00 for training and R2 = 0.99 for testing.
This shows how machine learning may be used to replace
traditional prediction techniques.
III. MATERIALS AND METHOD
A.Data preparation
The dataset was obtained from [3] and contains the lattice
constants of 132 ABX₃ cubic perovskites. The dataset was
divided into two subsets in this work: the training dataset,
which included 106 perovskites, and the testing dataset, which
included 26 perovskites.
B. Feature Generation
The lattice constant was modelled using the following
descriptors [3]:
The ionic radius: rA, rB and rX are the ionic
radii of cation A, B and anion X, respectively.
t: tolerance factor
2(rB+rX)
2(rB+rX)-a
C. Computational Software
Weka software 3.8.6 was used in this work to perform the
materials machine learning. The general procedures were as
follows:
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
Fig. 1 The general machine learning process with weka software
IV. RESULTS
A. Feature selection
As everyone is aware, it is essential to eliminate unneeded
characteristics in order to improve the prediction. In this work,
the wrapper method was used as a selection tool, and the basis
regressor was the support vector regression (SVR). As seen in
figure 2, the 10-fold cross-validation test clearly shows that
the tolerance factor with 100% dominance and the 2(rB+rX)
with 100% comprise the optimal feature selection.
rA(Å) rB(Å) rX(Å) t 2(rB +rX) 2(rB+rX)-a
0
20
40
60
80
100
Number of folds
number of features
Fig. 2 Histogram representing the selected features
B. Model selection
First, the support vector regression (SVR) and random
forest (RF) machine learning models were evaluated using a
range of criteria in order to determine which was the best
regression model. The performance evaluates MAE, RMSE,
and R are shown in Table I, which presents the findings.
TABLE I
COMPARAISON OF THE MODEL’S PERFORMANCE
Model Type
Testing
Training
R
RMSE
MAE
R
RMSE
MAE
Support
Vector
Regression
(SVR)
0.9884
0.0812
0.0685
0.998
0.065
0.061
Random
Forest
(RF)
0.9242
0.2215
0.19
0.9983
0.1447
0.1086
Next, the two models were re-examined through a
comparison of actual and predicted values. Sample outputs
from both models in the testing part were presented in two
separate tables (II, III) for a closer examination of prediction
accuracy.
TABLE II
RF MODEL: ACTUAL AND PREDICTED LATTICE CONSTANT
FOR PEROVSKITE COMPOUNDS
TABLE III
SVR MODEL: ACTUAL AND PREDICTED LATTICE CONSTANT
FOR PEROVSKITE COMPOUNDS
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
Then, scatter plots of predicted versus actual values were
created to visualize model performance in disparities between
models, the scatter plots are shown in figures 3.4.5 and 6.
Fig. 3 Scatter plot of actual versus predicted lattice constants (Å) for SVR
model on the training dataset.
Fig. 4 Scatter plot of actual versus predicted lattice constants (Å) for SVR
model on the testing dataset.
Fig. 5 Scatter plot of actual versus predicted lattice constants (Å) for RF
model on the training dataset.
Fig. 6 Scatter plot of actual versus predicted lattice constants (Å) for RF
model on the testing dataset.
C. Hyperparameter tuning
The results in the last section utilized hyperparameters that
were manually selected through iterative evaluation to achieve
acceptable performance from the model. The hyperparameters
that we manually selected were:
For the random forest (RF) model, we manually set
the number of trees (numIterations) to 400 and the
minimum variance for split was set at 0.001, all other
hyperparameters including max tree depth, max
number of features were their default values.
The support vector regression (SVR) model was run
using the RBF kernel with hyperparameters selected
manually. The value of the parameter C was 73.0, the
value of the epsilon ε parameter was 0.001, and the
kernel parameter gamma G was set to 0.1, while
leaving the other parameters set to their manually set
values as determined in the weka configuration.
I am sure it may be worthwhile to try to fine tune and
optimize the hyperparameters with automated tuning to
increase accuracy, but these parameter values were a fair
starting point for comparison purposes.
V. DISCUSSION
Both models showed excellent fitting abilities with R
values close to 1 in training. The SVR model showed an R of
0.998 with RMSE of 0.065 and MAE of 0.061, indicating a
very good fit to the training data with very low error.
Similarly, the RF model also recorded a high R of 0.9983 with
RMSE of 0.1447 and MAE of 0.1086 were considerably
higher than SVRs.This can be understood as RF having fitted
the training data to a good extent of correlation, but with
higher error values, indicating lesser accuracy.
In the testing phase, the SVR model performed better
compared to the RF model. The SVR gave a very high R
value of 0.9884, indicating strong correlation between actual
and predicted values. It also gave RMSE of 0.0812 and MAE
IDEAS: National Conference of Innovation on Data Engineering and AI Science.
June 17-18, 2025, Oran, Algeria.
of 0.0685, both of which were very low and reflected high
accuracy of predictions and low average error.
Conversely, the Random Forest model got a lower R value
of 0.9242. Its RMSE increased to 0.2215 and MAE to 0.19,
which are both considerably higher than SVR's. This indicates
that RF is less accurate, and makes greater average errors
when predicting the target variable on unseen data.
These numerical findings are supported by scatter plots Fig
(3.4.5.6) and table (II.III). The scatter plots show that the SVR
model provides more accurate predictions of lattice constants
on unseen data (Fig 4), with most points clustering near the
diagonal which suggests low prediction error, as shown in
table III. The RF on the other hand shows a greater spread to
the points with several points which deviate significantly from
the y=x line indicating greater prediction error than the SVR
model as shown in Fig 6 and table II.
VI. CONCLUSION
This study was successfully able to demonstrate the use of
machine learning algorithms, as run in weka, to predict the
lattice constants of cubic perovskites. By comparing two
algorithms, the findings showed very distinctly that the
support vector regression had performed better during testing
as it had higher accuracy, lower RMSE, and more stable
predictions as confirmed by both performance tables and
actual vs. predicted scatter plots. These findings have different
implications in materials science, especially in perovskite
materials which are:
Predicting the lattice constant can help develop new
materials faster and with better efficiency, since it
eliminates many candidates without having to carry
out all the experimental synthesis.
The lattice parameter has an effect on electronic,
magnetic, thermal, and mechanical properties so it is
important to accurately predict the lattice constant so
relationships can be understood and tuned.
Computational prediction is an effective way to
decrease the number of physical experiments that
decrease the amount of resources needed in synthesis
and characterization.
Predictive models will aide in designing crystals with
specific unit cell dimensions to fit potential
applications such as catalysis.
The SVR's findings consistently support the
confidence of using machine learning models for
predictive material design in practice.
REFERENCES
[1] Y. Zhang and X. Xu, “Machine learning lattice constants for cubic
perovskite compounds,” Chemistry Select, vol. 5, no. 32, pp. 9999
10009, 2020.
[2] A. S. Verma and V. K. Jindal, “Lattice constant of cubic perovskites,”
J. Alloys Compd., vol. 485, no. 1-2, pp. 514518, 2009.
[3] L. Q. Jiang, J. K. Guo, H. B. Liu, M. Zhu, X. Zhou, P. Wu, and C. H.
Li, “Prediction of lattice constant in cubic perovskites,” J. Phys. Chem.
Solids, vol. 67, no. 7, pp. 15311536, 2006.
[4] A. S. Verma and V. K. Jindal, “Lattice constant of cubic perovskites,”
J. Alloys Compd., vol. 485, no. 1-2, pp. 514518, 2009.
[5] Y. Zhang and X. Xu, “Modeling of lattice parameters of cubic
perovskite oxides and halides,” Heliyon, vol. 7, e07601, 2021.
[6] A. Alfares, Y. A. Sha’aban, and A. Alhumoud, “Machine learning-
driven predictions of lattice constants in ABX₃ perovskite materials,”
Eng. Appl. Artif. Intell., vol. 141, no. 2, p. 109747, 2024.