
2.2 Big Data 5
The majority of the data generated today is unstructured, which presents unique challenges
in terms of storage, processing, and analysis. Unlike structured data, which is neatly organized
in databases and easily accessible through simple queries, unstructured data lacks a consistent
format, making it difficult to store in traditional relational databases. This requires flexible stor-
age solutions, such as NoSQL databases [66] or distributed file systems, that can handle various
data formats and scales. Processing unstructured data is complex, often necessitating advanced
techniques like natural language processing, machine learning, and computer vision to extract
meaningful insights. The analysis of unstructured data also involves extensive preprocessing to
clean and transform the data into a usable format.
The distinction between structured and unstructured data is crucial because it influences how
data can be managed and utilized. Structured data is straightforward to query and analyze using
traditional database management tools and SQL queries. On the other hand, unstructured data
requires more advanced techniques, such as natural language processing (NLP) [31] and machine
learning algorithms [44], to extract meaningful insights.
2.2 Big Data
Over the last few years, Big Data has been the "center of modern science and business," gener-
ated from diverse sources like online transactions, emails, videos, and social networking interac-
tions. These data sets grow massively, making it difficult to manage and analyze using traditional
database software tools [55].
One issue due to this rapid expansion was the absence of a clear definition of Big Data.
According to Big Data: A Review by Sagiroglu et al. [55], Big data is a term defining massive
data sets with large, varied, and complex structures, presents storage, analysis, and visualization
challenges for further processing or results. Big Data is characterized by its three main compo-
nents, Variety, Velocity, and Volume, and requires revolutionary steps forward from traditional
data analysis as demonstrated in Figure 2.1.
•Variety This encompasses diverse data types and sources, both structured and unstructured,
such as emails, images, videos, sensor data, PDFs, audio files, and social media content from
platforms like Facebook, Twitter, LinkedIn, and YouTube. This diversity of data forms,
ranging from text and multimedia to log files and sensor data, poses significant storage,
mining, and analysis challenges.
•Volume to the enormous quantities of data generated from various sources, including indi-
vidual and organizational activities, social media interactions, and machine-generated data.
This vast volume of data, which used to be primarily user-generated, now encompasses a
broader array of sources.
•Velocity describes the speed at which data is generated, processed, and made available for
use. It encompasses the rapid flow of data from sources like business processes, machines,