VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF Free Download

1 / 100
0 views100 pages

VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF Free Download

VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF free Download. Think more deeply and widely.

alardalen University
School of Innovation, Design and Engineering
aster˚as, Sweden
Thesis for the Degree of Bachelors of Science in Computer Science -
15.0 credits
VERTICAL SCALING FOR BIG DATA
ANALYTICS AND PROCESSING - A
CASE STUDY
Mathias Asplin
man18002@student.mdu.se
Examiner: Abu Naser Masud
alardalen University, aster˚as, Sweden
Supervisor(s): Masoud Daneshtalab
alardalen University, aster˚as, Sweden
06/02/2025
Mathias Asplin Big data analytics and processing
Abstract
Big data is often used as a buzzword but despite this common usage of the term it is still
poorly understood. This paper explores new methods of handling big data based on technological
advancements in big data tools, cloud computing capabilities and file storage. This paper focuses on
new tools such as Polars, Duckdb and Daft to determine how mature these tools are and whether
they hold their promises. By using a busy environment consisting of a 4 core machine with 16GB
of RAM, Duckdb and Daft were able to analyze and process datasets ranging from a few million
rows to over a billion rows. While Polars fell behind as the data size grew, it proved formidable
for data engineering tasks consisting of data ingestion, wrangling and I/O. With the ability of
these tools to perform out of core computations on datasets many times larger than RAM and
cloud vendors offering machines with thousands of cores and tens of terabytes of RAM on a single
machine, analyzing and processing big data has never been more simplistic. Reduction in code
complexity, environment complexity, compute costs and lowering the skill ceiling for data analytics
and engineering problems are a few advantages when compared to distributed big data tools such as
Apache Hadoop and Spark.
i
Mathias Asplin Big data analytics and processing
Table of Contents
1. Introduction 1
1.1 Abriefhistoryofdata.................................. 1
1.2 Thetoolsneeded ..................................... 1
1.3 Bigdatainpractice ................................... 2
1.4 ProblemFormulation................................... 3
1.5 Scopeandlimitations .................................. 3
2. Background 5
2.1 Whatisbigdata? .................................... 5
2.2 Doesbigdatamatter? .................................. 5
2.3 Tools for big data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Distributed data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Howtousedata ..................................... 7
2.6 ToolsforETLprocesses ................................. 8
2.6.1 MicrosoftExcel.................................. 8
2.6.2 Relationaldatabases............................... 8
2.6.3 Python ...................................... 8
2.6.4 Commonly used Python libraries . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Dataframes ........................................ 9
2.8 Cloudcomputingconcepts................................ 9
2.8.1 Objectstorage .................................. 9
2.8.2 Compute ..................................... 9
2.8.3 Serverless ..................................... 10
2.9 Datastorage ....................................... 10
2.9.1 OLTPdatabases ................................. 10
2.9.2 OLAPdatabases ................................. 10
2.9.3 Datalakehouse.................................. 10
2.10 New tools for data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.10.1 Dask........................................ 10
2.10.2 DuckDB...................................... 11
2.10.3 Polars ....................................... 11
2.10.4 Daft ........................................ 11
2.10.5 ApacheParquet ................................. 11
2.10.6 ApacheArrow .................................. 11
2.11 Eager and Lazy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3. Related Work 13
3.1 Databases......................................... 13
3.2 Research using vertical scaling tools . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Bigdatamindset..................................... 13
4. Method 14
5. Ethical and Societal Considerations 15
6. The experiments 16
6.1 Overview ......................................... 16
6.2 Dataingestion ...................................... 16
6.3 Datawrangling...................................... 17
6.4 DataI/O ......................................... 17
6.5 Data exploration and analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.6 Detailsabouttests .................................... 18
ii
Mathias Asplin Big data analytics and processing
7. Results 20
7.1 DataIngestion ...................................... 20
7.1.1 Pandas ...................................... 20
7.1.2 Polars ....................................... 20
7.1.3 Duckdb ...................................... 20
7.1.4 Daft ........................................ 20
7.1.5 Dask........................................ 20
7.2 DataWrangling...................................... 20
7.2.1 Pandas ...................................... 20
7.2.2 Polars ....................................... 20
7.2.3 Duckdb ...................................... 21
7.2.4 Daft ........................................ 21
7.2.5 Dask........................................ 21
7.3 DataIO.......................................... 21
7.3.1 Pandas ...................................... 21
7.3.2 Polars ....................................... 21
7.3.3 Duckdb ...................................... 22
7.3.4 Daft ........................................ 22
7.3.5 Dask........................................ 22
7.3.6 Files........................................ 22
7.4 Data Ingestion, Wrangling and IO summary . . . . . . . . . . . . . . . . . . . . . . 23
7.4.1 Generaloverview................................. 23
7.4.2 CPUutilization.................................. 23
7.5 Data Exploration/Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.5.1 Smalldataset................................... 24
7.5.2 Mediumdataset ................................. 25
7.5.3 Largedataset................................... 26
7.5.4 Verylargedataset ................................ 26
7.5.5 Manylesdataset ................................ 26
7.6 DataAnalyticssummary................................. 27
7.6.1 Generaloverview................................. 27
7.6.2 CPUutilization.................................. 27
7.6.3 Walltime difference between datasets . . . . . . . . . . . . . . . . . . . . . . 28
8. Discussion 36
8.1 Benchmarking....................................... 36
8.1.1 Overview ..................................... 36
8.1.2 Continous improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2 Practicalconsiderations ................................. 36
8.3 Practicallimitations ................................... 37
8.4 Costs and relation to cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.5 New perspectives on big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.6 Business considerations and dynamic data storage . . . . . . . . . . . . . . . . . . 39
8.7 Bigdatainuniversities.................................. 39
9. Conclusions 40
9.1 Generalconclusions.................................... 40
9.2 Pandasevolution ..................................... 40
9.3 Polars ........................................... 40
9.4 DuckdbandDaft..................................... 40
9.5 Paired with cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
9.6 FutureWork ....................................... 40
References 45
Appendix A Appendex Python Imports 45
iii
Mathias Asplin Big data analytics and processing
Appendix B Appendix Pandas Code 46
Appendix C Appendix Polars In Memory Code 51
Appendix D Appendix Polars Out Of Core Code 56
Appendix E Appendix Duckdb Code 61
Appendix F Appendix Duckdb View Code 69
Appendix G Appendix Daft In Memory Code 76
Appendix H Appendix Daft Out Of Core Code 80
Appendix I Appendix Dask Code 85
Appendix J Appendix Graphs Code 90
iv
Mathias Asplin Big data analytics and processing
List of Figures
1 Printout of a Pandas dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Simplified architecture of the Parquet file both in memory and in storage . . . . . 11
3 Data Ingestion walltime average over 5 runs . . . . . . . . . . . . . . . . . . . . . . 21
4 Data Wrangling walltime average over 5 runs . . . . . . . . . . . . . . . . . . . . . 22
5 Filesizesinmegabytes.................................. 23
6 Data IO walltime average over 5 runs . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Data summary for ingestion, wrangling and I/O . . . . . . . . . . . . . . . . . . . 25
8 Data summary best performance for ingestion, wrangling and I/O . . . . . . . . . 26
9 Data summary of CPU utilization for ingestion, wrangling and I/O as a factor of
CPU time divided by walltime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
10 Data summary of best CPU utilization performance for ingestion, wrangling and I/O 28
11 Data Analytics small dataset walltime average over 5 runs . . . . . . . . . . . . . . 29
12 Data Analytics medium dataset walltime for one run . . . . . . . . . . . . . . . . . 29
13 Data Analytics large dataset walltime for one run . . . . . . . . . . . . . . . . . . . 30
14 Data Analytics very large dataset walltime for one run . . . . . . . . . . . . . . . . 30
15 Data Analytics many files dataset walltime for one run . . . . . . . . . . . . . . . . 31
16 Data Analytics summary for all runs, compared between datasets . . . . . . . . . . 31
17 Data Analytics best performance summary for all runs, compared between datasets 32
18 Data Analytics summary of CPU utilization as a factor of CPU time divided by
walltime for each dataset size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
19 Data Analytics summary of best CPU utilization performance between the tools for
eachdatasetsize ..................................... 33
20 Data Analytics time increase as a factor when going from a small dataset to a
mediumdataset...................................... 33
21 Data Analytics time increase as a factor when going from a medium dataset to a
largedataset ....................................... 34
22 Data Analytics time increase as a factor when going from a large dataset to a very
largedataset ....................................... 34
23 Data Analytics time increase as a factor when going from a large dataset to a very
largedatasetinmanyles................................ 35
v
Mathias Asplin Big data analytics and processing
1. Introduction
“The world’s most valuable resource is no longer oil, but data.”
This is the title of an Economist article from 2017 [1]. Data secretly runs the world. It is
everywhere and has never been as ubiquitous as now. Knowing how to use and handle data is
becoming increasingly difficult as the volumes grow. The term big data is often thrown around
in adjacency with subjects such as data analytics, data science, artificial intelligence and machine
learning, but what does it mean and how does it differ from traditional data? This is a question
that is constantly being asked by companies all around the world and they still do not agree with
each other on a precise definition.
1.1 A brief history of data
When computers were first invented, the philosophy around data boiled down to keeping as little
data as possible and not wasting any bits in the system. With the advances in computer hardware,
both software developers and companies noticed how they were not bound to such restrictions
anymore and by collecting data they could also analyze it.
The concept of analyzing data has been around for thousands of years. It was proven valuable
in gaining a competitive advantage or to improve decision making. Romans even used data to
predict the likelihood of events [2]. While the data analysis was historically done with pen and
paper, there was a change in the technology medium as computers became more sophisticated.
This technology developed out of sync however, while computers increased in performance, the
ability to store data increased even more. This is the fundamental change that led to the concepts
of big data and can be summarized to data that is too big to process in a single machine either
due to time to process or it simply not fitting into the machine’s main memory.
Data in the modern era comes from a variety of sources ranging from web based user content,
sensors, mobile devices, internet of things and shared through various mediums between companies.
The ability to process and analyze this data has become a serious concern for companies as only
6% state that they experience no problems with big data technology, according to BARC [3].
MongoDB, a NoSQL database provider, states that the overall amount of data that is stored
doubles every two years. In addition to this, some of the biggest challenges includes data quality,
data security, data processing and hiring skills big data engineers.
They write,
The applications for Big Data are practically limitless if organizations can find enough
people with the skills to implement them. Not many people are actually trained in Big
Data, and businesses face a major shortage of experienced and certified data scientists
and data analysts.
1.2 The tools needed
While very large technology companies such as Google, Meta, Apple, Microsoft and Amazon have
the resources build their own tools to solve these problems, it is not as simple for the vast majority
of companies to do so.
Free and open source tools were used a lot but they came with their own problems. Hadoop,
which works on a distributed systems with their MapReduce model, lacked support for a general
programming model. Even as early as 2012, 6 years after initial release, companies experienced
problems in regards to performance and resource management, among other things [4].
In order to solve the problems with Hadoop, Spark was created and shared the same open
source environment as Hadoop. Spark was built to use the distributed file storage of Hadoop,
1
Mathias Asplin Big data analytics and processing
HDFS, but processed the data differently. While Spark did not share the same core issues as
Hadoop, it had problems of its own which came down to how distributed systems work in general.
Skewing, where one node got more data than others, spilling data from primary memory to disk
and shuffling the data between nodes [5]. These operations are unavoidable in a distributed system
but can be mitigated by skilled big data engineers.
Possible ways to avoid these issues includes the following,
Buy services from cloud providers.
Buy services from third parties.
Buy super computers.
While these options work, each of them carry their own problem. Buying services from cloud
providers or third parties like Snowflake and Databricks can be expensive. They are prone to
the same risks of distributed computing as Spark as they often use distributed systems in the
background [6]. The companies also charge per query and data recovered which can be money
spilled due to errors in queries or unexpected bugs. Even if the user sometimes have the ability to
choose the amount of compute instead of a serverless configuration, developers and analysts alike
might not understand or care to tweak it for their individual problems and thus opts for more
compute than they need. Super computers is not an option companies chooses due to the heavy
initial cost as well as the cost of upkeep [7]. Super computers are also not scalable once the data
eventually reaches the limits.
1.3 Big data in practice
So, how would a list of requirements for big data analytics and processing look like?
Cost effective.
Scalable.
Performant.
Fault tolerant.
Flexible.
Easy to setup.
Easy to use.
While the solutions above solve some of these points, there is no solution that encompass all of
them.
According to a study made by Fivetran, an ETL service provider, the average Snowflake cus-
tomer spends $300,000 per year, of which 90% are queries and of those queries, over 50% are data
processing queries. The study further concludes that the median query scans about 100MB of data
and 0.1% of all queries scans above 300GB of data [8].
Distributed solutions seemed to be the defacto standard when dealing with big data, mainly
due to Hadoop’s popularity and the lack of other open source alternatives. With advancements
in cloud computing and new open source tools emerging, is it time to reevaluate the premise that
drives big data analytics and processing, namely the distributed model?
The big benefits of working with a vertical instead of horizontal scaling model are plenty.
Data skewing and data shuffling are no longer issues. With solid tools and frameworks the code
2
Mathias Asplin Big data analytics and processing
complexity decreases and the bar for working with big data is decreased drastically. The cost is
also decreased due to not needing multiple machines. Modern cloud providers offers machines with
tens of terrabytes of primary memory and hundreds of cores [9].
Additional benefits includes the ability to use laptops or cheap hardware to conduct complex
analytics should the data be big enough to fit in storage but not in the primary memory.
Distributed analytics and processing will always have its place, for example for I/O heavy
computations or to add multiple GPUs. What vertical scaling offers applies to the majority but
not all cases.
This paper will explore the tools and concepts which lies the foundation of a new way of thinking
regarding the ability to analyze and process large amounts of data. While the experiments will be
conducted on tabular data, the methods can equally be applied to any other type of data processing
such as image processing. It will also touch on the advancements of cloud technologies that enables
single machine data processing to scale.
1.4 Problem Formulation
This paper is a case study which primarily focuses on the tools and frameworks for big data
analytics and processing, mainly looking at tools for tabular datasets. It will evaluate tools in
terms of speed, memory stability and ability to perform computations on data outside of the
primary memory.
The question this paper aims to answer is which tools exist, how can they be used for big
data analytics and processing as well as what the average user should expect from them. The
benchmarks are done primarily as a way to illustrate their strengths, limitations and evolutions
rather than to compare them in terms of speed. Multiple tools can be used in conjunction to solve
complex problems.
Some of the core principles of vertical scaling includes.
Cost effective data analytics and processing.
Possibility of vertical scalability.
Performant tools and frameworks.
Fault tolerant data analytics and processing.
Reduction in complexity regarding setup and usage.
Cost effective data analytics and processing is important to keep costs low in any environment,
whether it is a company environment, research environment or for personal usage. Vertical scal-
ability is important to adjust for both small and large data loads. It is important to ensure the
tools and frameworks actually utilizes all of the computers resources such as all of the cores and
I/O capabilities fully. For complex analytics it is necessary to ensure if a fault happens it happens
gracefully or the ability to reduce the sources or faults in general. Lastly, it is important to ensure
a reduction in technical complexity compared to a distributed setup, both in terms of the compute
environment and the usage of the compute environment.
1.5 Scope and limitations
1. This case study will not go into the extreme cases nor go into areas where analytics is not
used, for example distributed request handling of a large website split across several servers.
3
Mathias Asplin Big data analytics and processing
2. Even though some of the tools used in the benchmarking can use GPUs for processing, this
will not be considered due to the GPUs themselves requires in memory processing.
3. Only the primary means of writing queries will be tested, so if a tool has an SQL interface
that is not the primary means of using the tool, it will not be considered.
4. This case study will not use custom functions.
5. This case study will also not consider all of the options, configurations or settings. It will
have an ”out of the box” approach.
4
Mathias Asplin Big data analytics and processing
2. Background
2.1 What is big data?
This is a question that is constantly being asked by companies all around the world and they still
do not agree with each other on a precise definition.
Michael Chen at Oracle, a software company known among other things for their database
writes that
“Big data refers to extremely large and complex data sets that cannot be easily managed
or analyzed with traditional data processing tools, particularly spreadsheets.” [10]
They further continues by clarifying the characteristics of big data, namely the five V’s, volume,
velocity, variety, veracity and value.
Informatica, a software company that focuses on data engineering tools, writes
“Big data refers to the 21st-century phenomenon of exponential growth of business data,
and the challenges that come with it, including holistic collection, storage, management,
and analysis of all the data that a business owns or uses.” [11]
They further continues by clarifying the characteristics of big data, the four V’s, volume,
velocity, variety and veracity.
Google also has a different take on big data, they write as follows
“Big data refers to extremely large and diverse collections of structured, unstruc-
tured, and semi-structured data that continues to grow exponentially over time. These
datasets are so huge and complex in volume, velocity, and variety, that traditional data
management systems cannot store, process, and analyze them.” [12]
While these three different takes on what big data is does have similarities, it still leaves
questions unanswered. Is my data big data? Do I need to use big data tools? How many V’s do I
need to consider? Three like Google, four like Informatica or five like Oracle?
2.2 Does big data matter?
Some companies might even question the need for analysing and handling big data at all. Why
bother with costly infrastructure and staff to manage an analytics platform? According to Google,
companies in general perform better when they’re able to analyze their data. [12]
“58% of companies that make data-based decisions are more likely to beat revenue targets
than those that don’t.”
“Organizations with advanced insights-driven business capabilities are 2.8x more likely to
report double-digit year-over-year growth.”
“Data-driven organizations generate, on average, more than 30% growth per year.”
“We don’t have better algorithms. We have more data.”
5
Mathias Asplin Big data analytics and processing
This is a quote from Peter Norvig, Google’s Research Director [13]. His quote refers to how more
data is more effective in training Machine Learning models rather than fine tuning their models,
however this can also be applied to regular statistics and other forms of statistical modeling. For
companies, being data driven is becoming a necessity and with big data it is becoming a challenge.
For the small companies it can be about costs. Staff, data analytics platforms and services to
handle the data. For large companies it can be about data complexity, integration with legacy
systems and processing costs. These are merely the business aspects of problems with big data.
While big data poses a challenge, research also shows that companies who can handle their big
data also benefit from it. In the end, the decision to become data driven is entirely individual,
but in a world where data has been hailed as “the gold/oil of the 21th century”, the question of
whether the companies not willing to adapt will suffer because of it is becoming a yes.
2.3 Tools for big data processing
Tools for big data can be considered a subject of its own where the ecosystem consists of thousands
of different variants, each with their strengths and weaknesses as well as overlaps. It is therefore
unreasonable that anyone knows all of them. This paper will focus specifically on the processing
tools and only bring up a few popular tools and services today without explaining how they work
in detail.
Apache Hadoop launched in 2006. It quickly became a defacto standard for big data processing
due to its ability to distribute the data with its Hadoop Distributed File System (HDFS) and
workload via its MapReduce paradigm on multiple machines, thereby leveraging a combined effort
to solve the processing problems [14].
Apache Spark release in 2013 as an improvement of the MapReduce paradigm of Hadoop, which
required a linear flow which was not suitable for all processing demands. Spark relies on the HDFS
solution of Hadoop and is also designed primarily for distributed tasks [15], [16].
There are plenty of companies that offers data analytics as a service, most notably Snowflake,
Databricks, Amazon AWS, Microsoft Azure and Google Cloud. What all of these services and
companies have in common is that their solutions are based on a distributed architecture to leverage
the scalability of machine clusters [8]. While it is possible to run these tools and services on
single machines, the overhead voids the benefits. When using the services for data analytics and
processing, the question of cost comes into the picture.
6
Mathias Asplin Big data analytics and processing
2.4 Distributed data processing
Hadoop has been popular in the big data industry for two primary reasons, one, its HDFS which en-
ables storage of data across multiple machines and two, its MapReduce programming model which
enables processing across these machines. While the technology has constantly been improved
upon, the philosophy still remains the same, to split the data and processing across multiple
worker nodes which enables them to work together. This has many advantages.
Horizontal scalability. If the processing is slow, simply add more worker nodes.
Vertical scalability. Nodes can use multi-core processing.
Batched processing. It does not matter if the data fits into the main memory or not.
Persistent storage across multiple devices as a logical unit. Should one worker node fail, there
is redundancy on other machines.
While there are advantages, there are also disadvantages.
Environment complexity. Setting up and configuring the main node and worker nodes is a
challenge by itself.
Programming complexity. If data needs to be transferred between nodes often there is a
reduced performance gain. Bugs can be extremely difficult to solve.
Cost. This comes in many shapes and forms. The cost of setting up a cluster. The cost of
maintaining a cluster. Network costs. Storage costs, including HDFS redundancy storage
cost.
Optimization difficulties. Even with competent staff it can be difficult to optimize the uti-
lization of the cluster. If it is not in use it drains money, but if it’s shut down there’s a cost
of getting it back up.
This technology has been improved upon during the past decade but the disadvantages are
still there. These distributed solutions requires the scale to be truly above the capacity of a single
machine in order to be efficient and necessary. The overhead of of these systems when used for a
single node only helps to drain the resources.
2.5 How to use data
While distributed data processing explains how to handle big data, it does not explain the use
cases for big data, or data in general. While the many components needed for systems to handle
big data is out of scope for this paper, knowing what Extract, Load and Transform, commonly
referred to as ETL, actually means is crucial for understanding why data engineering is not a trivial
task as it may seem at first glance.
Extract is the process of taking data from one or multiple sources, sometimes from other local
databases, other times from APIs of other systems. Load is the process of loading it into a program
or database. Finally, Transform is the process of changing its shape to suit the needs. This is what
this paper will be focusing on as it often is a bottleneck in many data processing systems. Some
examples of work can be found below.
Cleaning data which includes removing unnecessary columns, renaming columns, removing
or fixing bad values, changing data types, renaming words and a lot more.
Sorting data is also necessary, often alphabetically or date wise.
Grouping data, perhaps the same product or product group is a core feature.
Data aggregations such as summations, means or standard deviations are also often necessary.
7
Mathias Asplin Big data analytics and processing
Combining data through joins or concatenations.
Custom functions.
After the ETL process is done it can be used for other purposes, including dashboards for
daily management, queried by data scientists or analysts to use in data modeling or statistics, sent
or sold to customers, used to train or update Artificial Intelligence models and more. The ETL
process can happen multiple times before data becomes consumption ready and it can even be ran
several times a day if necessary.
2.6 Tools for ETL processes
While Hadoop and Spark can be used for ETL it is not a necessity. Unlike preparing large datasets
for a Machine Learning model to train, ETL can be both small and large in scope, usually a mix
of the two. This means most ETL processes today does not need a large cluster to be processed,
it can require a small cluster or a single node. While there are “low/no code” tools and services,
these will not be considered due to their inflexibility.
2.6.1 Microsoft Excel
Excel is a widely known tool for many purposes. A typical use is data handling, with its tabular
format it is easy both to handle and visualise data. Many ETL processes uses Excel as a processor,
even automated ETL processes due to its ability to include scripts. However, this environment
is not ideal for handling large amounts of data. It has a limit of around 1 million rows and 16
thousand columns and is not flexible enough for all use cases [17].
2.6.2 Relational databases
Relational databases is a widely used database structure similar to the tabular format of Excel,
although it is Excel that has taken inspiration from relational databases than the other way around.
Its tabular format allows for multiple benefits such as minimal storage and not storing the same
data multiple times. Structured Query Language, SQL, was designed to perform operations on
these databases, from joining multiple tables to ingesting or extracting data.
It is a powerful system capable of planning a query in advance in order to process the query
as fast as possible, including using multiple cores. If the user wants to remove all entries before a
date, the query engine will simply not read that data, for example. This is another common way to
perform operations on an ETL pipeline however it is not flexible as it needs to run on a database
and it cannot perform the same type of operations as a general purpose programming language.
2.6.3 Python
Python has become a popular programming language for data analytics and processing due to
its rich environment of libraries and ease of use because of its clean syntax and code being inter-
preted rather than compiled. Whether the use case is building web applications, making charts
or building Machine Learning models, it has libraries that supports each case in depth with very
few lines of code. While it is popular it has downsides such as its Global Interpreter Lock (GIL)
mechanism, locking the program to a single thread of execution. In addition to this, the language
is considered slow in its execution due to its inability to let the compiler to perform optimizations
before execution [18], [19].
2.6.4 Commonly used Python libraries
Due to Python’s limitations in performance, libraries are often created in other, more performant
languages. One such example is NumPy which is written in Python and C. Its core focus is
numerical mathematical functions, capable of building multidimensional arrays and using a variety
of different mathematical functions, focusing on bridging the performant C programming language
8
Mathias Asplin Big data analytics and processing
to the simple syntax of Python as well as to have a cohesive library of widely used functions and
operations [20].
A second library that was built on top of the NumPy library is Pandas. Pandas stands for
Panel Data, an econometrics term for time series data. Pandas is able to structure data into
datasets called dataframes, resembling tables. It differs in its underlying architecture however, as
it stores data column wise in memory, making it an effective choice for processing data column
wise, which most data operations requires. It is designed to effectively manipulate data whether
it’s filtering, combining or using custom functions. It is integrated into other libraries such as
Matplotlib, making it easy to turn the dataset into charts for visualization. Pandas is also built in
high performant languages such as C and Cython, a performant version of Python [19], [21].
2.7 Dataframes
Dataframes are 2D tables organized as rows and columns as figure 1. They process data column-
wise and stores the data as columns rather than rows. This makes column-wise operations faster.
Dataframes usually supports common operations for data management and analytics such as se-
lecting subsets by filtering, sorting, aggregations, group bys, joining, replacing and pivoting.
Figure 1: Printout of a Pandas dataframe
2.8 Cloud computing concepts
The topic of cloud computing is vast. Cloud computing is widely used for data storage, analytics
and processing. The three biggest cloud service providers consist of Amazon AWS, Microsoft
Azure and Google Cloud. In general, these providers share the same core services such as renting
computers and storage. This paper will use AWS terminology, but do note both Azure and Google
Cloud have similar services [22].
2.8.1 Object storage
The first fundamental concept is the object storage, called S3 by AWS, sort for Simple Storage
Service. It can be described as an abstract way of storing files. AWS sorts files into buckets and
each bucket can contain an unlimited amount of files of any type, up to a size of 5TB per file. The
best way to imagine it is an infinitely sized secondary storage device like SSD or HDD. Files can
be divided into folders much like a regular secondary storage device. Depending on usage, data
stored in an object storage can be referred to as a data lake [23], [24].
2.8.2 Compute
The second fundamental concept is called EC2 by AWS, short for Elastic Compute Cloud. It
is a computer renting service of different sizes. The smallest contains a single core and 0.5GB
9
Mathias Asplin Big data analytics and processing
of memory while the largest has thousands of cores and tens of terabytes of memory. The main
idea behind this service is being able to rent a computer for any time needed, whether its for a
few minutes or for years. It supports anything from bare metal instances to any flavor of Linux,
Windows or MacOS through ISO images which can be described as a snapshot of an operating
system. Any computer rented only contains the software provided by the ISO image and the act
of adding more software or loading data into the computer is referred to as provisioning [25].
2.8.3 Serverless
Serverless refers to the ability to run code without renting a specific EC2 instance. It automatically
provisions the computer and runs the code when necessary, this can be as a part of an ETL or as
an API service [26].
2.9 Data storage
It is important to clear up common misconceptions and differences between the main types of data
storage options and how they are used.
2.9.1 OLTP databases
OLTP databases are the most common type of database, used for transactional processing of
all kinds. The distinctive difference is both how it is used and how it works internally. OLTP
databases are primarily used for adding rows, updating rows or changing data in cells. Because of
their use case, the data is stored row-oriented in memory and on the storage medium. Microsoft
SQL server, MySQL, PostgreSQL and MongoDB are typically row-oriented [27].
2.9.2 OLAP databases
OLAP databases, often referred to as datawarehouses, are designed for analytical workloads. These
differs from OLTP workloads as they typically only involve reads and they also work on columns
rather than rows. An example of this is having a column with total sales and querying it for
the sum of all sales. These store data column-wise both in memory and on the storage medium.
Examples of databases with OLAP capabilities are AWS Redshift, Azure Synapse Analytics and
SAP HANA [27].
2.9.3 Data lakehouse
While databases, both OLTP and OLAP, requires a computer to be running at all times, data
lakehouses leverages object storages to store data and only uses a computer to run queries. This
enables a separation of storage and compute which drastically reduces costs as storage is cheaper
and the queries are not run at all times. They can be queried with SQL much like a regular
database, but they work differently internally. Examples of managed data lakehouses are Snowflake,
Databricks and AWS S3 Tables. An example of data lakehouse technology is Apache Iceberg [28].
2.10 New tools for data processing
While Pandas (released 2008) and NumPy (released 2006 as NumPy, 1995 as Numeric) has been
around for decades and are widely adopted, to varying degrees, new libraries and tools have emerged
only in the past few years.
2.10.1 Dask
Released in 2015, Dask aimed to provide a parallel and distributed environment for Python pro-
grams. Its dataframes are build on top of Pandas to parallelize and distribute computations. It is
primarily designed as a distributed environment similar to Apache Spark. Dask works lazily and
only evaluates the data when needed and then returns a Python, NumPy or Pandas object. Dask
Arrays works with NumPy objects, Dask Bags with Python lists and Dask dataframe with Pan-
das. It supports real time eager function evaluation as well as batched lazy function evaluations. It
10
Mathias Asplin Big data analytics and processing
has support for Machine Learning through scikit-learn wrappers, enabling distributed and parallel
PyTorch, Tensorflow and Keras applications [29]–[31].
2.10.2 DuckDB
Released in 2019, DuckDB is a column oriented relational database management system focusing
on online analytical processing. It is designed for data science and data engineering workloads
and promises ACID compliance similar to a regular relational database. It prides itself in being
incredibly flexible and portable, able to run almost anywhere, including Windows, Linux and
Mac, both x86 and ARM CPU architectures, web browsers, mobile phones and inside programs
themselves [32]–[34].
2.10.3 Polars
Started in 2020, Polars promises fast data processing through vectorization, parallelization as well
as lazy evaluation and out of core processing. It is written in Rust and promises no external
dependencies as well as full control of API, memory and execution of the code. Much like pandas
it is a dataframe oriented library rather than a database or optimization library [35], [36].
2.10.4 Daft
Released in 2022, Daft refers to itself the ”Unified Engine for Data Engineering, Analytics &
ML/AI”. Like Polars, it is written in Rust. It is designed for data loads of megabytes to petabytes.
It is both parallel and distributed. Daft only works lazily and its use cases range from data en-
gineering and analytics to machine learning. Unlike Polars, it is multimodal and enables image
processing for machine learning applications. Unlike Dask, which can run on multiple types of clus-
ters, including Hadoop, Daft only works on Ray, which is a distributed data processing framework
designed for large scale training of machine learning models [37].
2.10.5 Apache Parquet
Released in 2013, it is a column-oriented data file format that is designed for fast read and writes.
It promises high performance compression and encoding schemes to handle complex data in bulk.
It also stores metadata for better analytical processing [38].
Figure 2: Simplified architecture of the Parquet file both in memory and in storage
2.10.6 Apache Arrow
Open source project started in 2016, the goal being to standardize data structures in dataframe
implementations, improve interoperability with other data processing systems as well as promote
11
Mathias Asplin Big data analytics and processing
collaboration and code reusage across libraries and programming languages. It also works to ad-
dress issues of limited data types, excessive memory consumption, poor performance for operations
with strings and to accommodate larger than memory datasets [39].
2.11 Eager and Lazy evaluation
A short description is that eager evaluation executes the instructions as soon as they can and lazy
evaluation executes when needed [40]. Both concepts have advantages and disadvantages. For
eager evaluation, it can be faster on smaller datasets because it does not need the added overhead
that lazy operations do and it releases resources as soon as an expression is evaluated. For lazy
evaluation, the expression tree can be optimized to perform fewer computations before execution
and resources like memory and I/O operations can be optimized or shared for the entire expression
tree.
12
Mathias Asplin Big data analytics and processing
3. Related Work
3.1 Databases
While there has been plenty of research put into database systems, data warehouses and query
optimizations, these advancements have yet to consider new technology such as Apache Arrow and
Apache Parquet as well as the database engines and dataframe tools which uses them [41]. While
there are some papers that focuses on Duckdb, these are made by businesses that seek to market
their product based on the Duckdb engine rather than exploring data analytics and processing
tools [42]–[45].
3.2 Research using vertical scaling tools
For big data processing and analytics distributed solutions still dominate, however there are papers
where the authors have opted to use Polars or Duckdb instead of Hadoop and Spark. Both [46]
and [47] were able to leverage Polars and Duckdb not only to enable big data processing but also
to achieve better performance than distributed solutions.
3.3 Big data mindset
While some research papers includes single node big data tools, a good example of the current
mindset regarding big data can be found in [48] from 2022 and it also perfectly illustrates the
benefits of vertical scaling.
“With the development of data sensing and Internet of Things technologies, the scale
of data continues to increase to reach TB, PB, and even higher levels, which can no
longer be completed by a single computer. This puts forward higher requirements for
the computing power and timeliness of the computer to process big data. Big data
processing involves the configuration of the storage system, the division of computing
tasks, the distribution of computing load, the data migration between computers, and
the data security when the computer or network fails. The situation is much more
complicated.”
13
Mathias Asplin Big data analytics and processing
4. Method
This paper will mostly explore the emerging tools mentioned in background in a test environment
using an empirical experiment model. The test environment will be ran in a Jupyter notebook
environment on a busy system which has resources dedicated to other uses. It is meant to represent
a typical analytics exploration environment professionals uses instead of a server that is dedicated
to big data processing. This sets the bar low for most use cases. The primary concern is to test
whether these tools can do what they claim rather than how fast they can do it. While speed is
important and will be tested, it is extremely difficult to benchmark in a fair way since the tools
specialize on different things. It is easy to understand how Dask, a framework for distributed
processing, should not be compared in speed to libraries optimized for single node processing.
Nevertheless it is interesting to measure how this difference in performance can be leveraged for
big data that can be run on a laptop or similar small scale instances.
This paper will also explore the underlying file formats such as comma separated values (csv)
versus the parquet format. While both are used, it is important to measure their strengths and
weaknesses and take note of when a file format should be favored over the other. It will also explore
the associated systems such as cloud computing but this will be done as an overview rather than
actually testing them. It is also important to note that while this paper explores vertical scaling,
it will not actually use cloud computing to vertically scale the system for testing, instead all of the
tests are done on a local Windows 10 environment with 4 cores and 16GB of memory. Because of
the busy environment setup, these cores and memory cannot be utilized fully as unknown amount
of CPU and 10GB of memory is used by other processes.
It is further important to emphasize that these tools have a lot of cutomizable options and
exploring them all would be difficult within the scope of this paper. For example Polars has a SQL
interface, a streaming mode and can be ran on the GPU. Even though GPU acceleration would
increase the performance and there might be a difference in performance between the Python
interface and SQL interface, as well as a performance increase for handling larger than memory
data. These settings are not obvious enough and only the advanced users would be aware of them.
Polars also recommends compiling Polars from the source with a Rust compiler for maximum
performance and using bigidx for rows exceeding 232 4.2B rows. This paper therefore takes an
”out of the box” approach and only makes small changes such as the compression format of the
parquet files to ”snappy”, as the default differs between the different tools.
These tools also has different implementations which are subtle but noticeable in comparisons.
An example is when sorting values, Polars defaults to NULL values being placed first, while Pandas
places them last. This means in order to get the ten largest values, another operation is required for
Polars in order to drop the NULL values. Tools to address these issues exist, both Ibis and Narwhals
aims to standardize these operations and make them consistent between dataframe libraries and
SQL engines [49], [50]. At first glance it might seem strange not to use Ibis to standardize the
code but there exists multiple problems, namely that the Ibis API does not support Pandas, Dask
or Daft and its API for Polars and Duckdb is limited.
The data collected from the experiments will be analyzed using Python/Pandas and the visu-
alization libraries Seaborn/Matplotlib. It will be transformed and processed with Pandas.
14
Mathias Asplin Big data analytics and processing
5. Ethical and Societal Considerations
While the data is sourced from Kaggle under the Open Data Commons Open Database License,
a license intended to allow users to freely share, modify, and use this Database while maintaining
this same freedom for others, there are other aspects to consider that goes beyond the sourcing
and usage of data [51].
With the emergence of AI systems based on data, there has been and will always be good and
bad actors. A bottleneck to perfect these AI systems lie in the ability to handle the vast amounts of
data. This has historically been done via expensive or large scale systems which requires resources
and expertise to use and maintain. This paper explores tools and ideas which drastically reduces
the complexity of these big data systems to be handled by non-experts without training beyond
the usage of libraries similar to Pandas or SQL databases.
Even with the lowered ceiling for handling data, it is far fetched to propose the idea that this
will lead to more bad actors. Handling big data requires data in the first place and this is something
expensive to gather and store. The tools is not everything either, knowing how to handle data is
also a key aspect, from mathematical foundations to knowledge of how computers work. The bar
is lowered, but this is for those who are already familiar with tools such as Pandas.
On the positive aspects, the methods in this paper could drastically bring down costs and
computational wastefulness. Some of the authors of the tools brought up in this paper defines big
data as data that goes beyond their tools, which easily ranges up to the three digit terabytes or
petabytes. This is rare for companies, even large companies, to actually handle in an analytics
environment. This together with lowering the ceiling for handling big data for professionals enables
analytics that was previously impossible or difficult to do. Because of the cost, a single or small
team of data scientists either could not or required big data skills and systems in order to perform
analytics on datasets that would not fit into memory.
15
Mathias Asplin Big data analytics and processing
6. The experiments
6.1 Overview
One of the important questions of this paper is to answer whether the tools can perform the big
data analytics and processing at all, rather than how fast they are. This is why this paper does
not compare the results to distributed solutions such as Hadoop or Spark. Any vertical system
should in theory perform better than any distributed system should they be equal in resources.
This paper has therefore chosen the dataframe tools described in the backgrounds section. It
will explore whether the most widely used dataframe tool Pandas has advanced enough to be able
to perform big data tasks or if it still limited in its range. It will also introduce newer tools such
as Dask, Polars, Duckdb and Daft. These tools provide a multitude of settings, configurations and
options. While this paper does not explore each of them in detail, it will explore the difference
between eager and lazy execution as this is key to the ideas of out of core computing and streaming
queries.
The dataset used comes from the user Mcfurland on Kaggle, an open source competition plat-
form for data science and machine learning [51]. It is 7.36GB of multiple csv files containing
millions of songs, artists, labels and audio features from Beatport, an online music store, as well as
Spotify, a music streaming service. Only the Spotify portion of the dataset was used however as
the dataset was disjoint between the different sources. The Spotify dataset contains 5.77M tracks
and 4.68M audio features.
The algorithms used explores four aspects, with the last part being the largest one. The
algorithms are chosen in such a way to represent realistic usage of the tools, rather than making
the operations overly difficult or complex. Any complex workflow can be divided into subsections
regardless, so testing if these tools can handle multiple consecutive group by and joins does not
make much sense.
Data ingestion
Data wrangling
Data I/O
Data exploration/analytics
The versions of each library is listed as.
dask 2024.11.0
duckdb 1.1.3
getdaft 0.3.11
numpy 2.1.3
pandas 2.2.3
polars 1.12.0
pyarrow 18.0.0
6.2 Data ingestion
Data ingestion is divided into two parts. The first part reads the raw csv files and the second
part joins the tables together into one big table. For Pandas, different options are tested, both
the newer PyArrow engine and datatype backend, as well as the older types of the Python and C
engines and NumPy backend.
16
Mathias Asplin Big data analytics and processing
6.3 Data wrangling
Data wrangling is divided into four parts, each cleaning the dataset. The first part changes the
duration column from miliseconds into seconds. The second part turns the explicit column from
string into boolean. The third part fills missing datetime values and turns them into datetime
objects. The fourth part finds all artist names for tracks and combines them into the original
dataset, this is to make it easier to associate a track with all of its artists.
6.4 Data I/O
Data I/O is divided into four parts, except for Polars which has a rust implementation for their
parquet writer, which is tested separately. The first part writes to csv file. The second part reads
from csv file. The third part writes to parquet, the fourth part reads from parquet. For Pandas,
only the fastest engine will be considered for this part.
6.5 Data exploration and analytics
Data exploration/analytics is divided into 18 parts. For Pandas, both the NumPy backend and
the PyArrow backends will be tested. This part will be tested on the original dataset, which
uncompressed is about 1.7GB and 7.7M rows. For the larger datasets, the original dataset till be
duplicated row wise. It will be tested on a medium dataset, which is 5x the original size, then on a
large dataset which is 30x the original size. Finally it will be tested on a very large dataset which
is 150x the original size. There is also a set of multiple files equivalent in size of the very large
dataset. The largest dataset is above 250GB uncompressed and has more than 1 billion rows of
data with 24 columns.
1. Find top 10 popular tracks except the artists above with tracks between 2010 and 2020 that
does not contain profanity, is more than 3 minutes long, is good at dancing to or has good
energy, with a higher tempo than mean, that is not performed live.
Operations: This query is primarily about filtering. It filters a list of artists with a NOT
OR statement, a range of dates, NOT explicit, greater than or equal to integers and
floats, aggregations of means and less than or equals to. It also sorts the values in
descending order, providing the top 10 results.
2. Find the artists who has made the most tracks.
Operations: This query includes a group by, get size and sort values.
3. Find the artist that has made the most popular songs on average.
Operations: This query includes a group by, mean aggregate and sort values.
4. Find the artist who on average makes the most danceable tracks.
Operations: This query includes a group by, mean aggregate and sort values.
5. Find the artist that makes the longest tracks on average.
Operations: This query includes a group by, mean aggregate and sort values.
6. Find the artist who makes the happiest tracks on average.
Operations: This query includes a group by, mean aggregate and sort values.
7. Find the artist that collaborates with most record labels on average.
Operations: This query includes a group by, unique, size and sort values.
8. Find the artist that has the most tracks with profanity.
Operations: This query includes a group by, size and sort values.
17
Mathias Asplin Big data analytics and processing
9. Find the year with most popular tracks on average.
Operations: This query includes a group by, mean aggregate and sort values.
10. Find the artist with the most popular tracks on average per year.
Operations: This query includes a group by on two columns, mean aggregate, sort values
and unique.
11. Find the artist with the most track credits.
Operations: This query varied but in basic form it included unique, count char in string,
adding and sort values.
12. Find the artist most active throughout the years.
Operations: This query included group by, max aggregate, min aggregate, subtraction, sort
values.
13. Find the album type distribution.
Operations: This query varied but in basic form it included group by, size, sum and division.
14. Find the album type that is the most popular on average.
Operations: This query includes group by and mean aggregate.
15. Find the mean of different measurements.
Operations: This query includes the mean of multiple columns.
16. Find the median of different measurements.
Operations: This query includes the median of multiple columns.
17. Find the standard deviation of different measurements.
Operations: This query includes the standard deviation of multiple columns.
18. Find out what makes a track popular.
Operations: This query includes a group by, mean aggregate on multiple columns and sort
values.
6.6 Details about tests
The tests for all parts will be run five times for the regular dataset sizes. The time to run the
medium, large, very large and many files datasets is considerably higher, so they will only be run
once. The tests will measure if the operations works or if there’s a limit in terms of memory,
essentially if the operation crashes or it is clear the tool is at its limits due to taking too long time,
they will not run for the larger datasets. Each operation will be timed with the walltime in seconds
as this is the most representative way of evaluating the performance for the whole system. For the
eager evaluation each part will be timed individually and for the lazy evaluation two of the four
parts will be timed as a whole, namely the data ingestion and data wrangling parts.
The libraries will be tested for both their in memory capabilities and out of memory capabilities.
While Dask and Daft are both only lazy, dask reduces the result to a Pandas dataframe while Daft
has its own format. Therefore Daft is also tested by trying to force it to be eager, essentially
evaluating the parts one at a time rather than all together. Pandas is eager only and will be
treated such. While Duckdb is lazy, when creating a table it converts the data into its own format.
This is how a database is expected to work, however in this case the only usage that is interesting
is the data analytics and processing. Therefore tables will be considered as the normal Duckdb
option and the Duckdb Views as the out of core version. These are in principle not the same as
18
Mathias Asplin Big data analytics and processing
eager versus lazy evaluation, but it is the closest way to practically evaluate them versus other
tools. Duckdb Views has limitations however, mainly related to data processing as it does not
support dropping columns. Lastly, the Daft library does not support writing single files, instead
it uses unique UUIDs for each file and the only option is to name the folder these files are written
to. Daft itself determines how many files it shall write.
Generally, each library has its strengths and weaknesses. For example neither Dask, Daft nor
Apache Spark supports calculating a median because they are made for distributed environments.
If there is an operation that a library does not support, it will simply be ignored or worked around
without using Python code.
19
Mathias Asplin Big data analytics and processing
7. Results
7.1 Data Ingestion
7.1.1 Pandas
Pandas stood out in options for both reading csv files and merging them together, having three
engines for reading/writing csv files, namely the Python implementation, the C implementation
and the PyArrow implementation. It also supports two backends, NumPy and PyArrow. PyArrow
seemed to be ran in parallel by default, which can be seen by its low ingestion time compared
to the Python and C alternatives according to figure 3. While the Python engine is the most
configurable, it also seems to be the slowest by a large margin, about 50x slower than the PyArrow
implementation and 3x slower than the C implementation. The C engine was about 15x slower than
the PyArrow engine which can not be fully explained by it simply being parallel as the environment
only has 4 cores. It seems like it also handles data better in other ways which explains why it was
over 7x faster when joining the tables. Overall the Pandas PyArrow implementations were on par
with the other libraries, both for reading the files and for merging them.
7.1.2 Polars
Polars seemed to fair better when combining the two operations of reading and merging the files,
almost a 2x improvement. This can be explained by its internal query optimization which it uses
for lazy evaluations but not for eager ones.
7.1.3 Duckdb
Duckdb evaluates everything lazily but it was not as fast as Polars, being even slower than Polars
eager evaluation. This can partially be explained by Duckdb’s internal mechanisms that converts
the data into its own formats, which possibly requires extra steps.
7.1.4 Daft
Daft did not seem to outperform neither Polars or Duckdb and trying to force it to evaluate
the reading and the merging separately did not matter for the performance, possibly due to not
optimizing the query plan as well as Polars.
7.1.5 Dask
Dask used the PyArrow implementation which was overall about 3x slower than Pandas. This
is both surprising and not at the same time. Dask was made for distributed computing so the
overhead might require more on a single machine, but it also is able to carry out the executions in
parallel. A possible explanation is that it simply did not carry out any executions in parallel while
having a lot of overhead.
7.2 Data Wrangling
7.2.1 Pandas
Pandas seemed to struggle a lot with making a column that listed all artists for the tracks. The
PyArrow version seemed to be slower than the NumPy version according to 4, which could be
because of a bad implementation of the formats or something internal with Pandas. For the other
operations it kept up to pace with the other libraries.
7.2.2 Polars
Polars did not seem to benefit a lot from the query planning for the data wrangling. This is to be
expected as it is no guarantee of a speedup. Overall it was slightly faster than the eager evaluation
and faster than the other libraries.
20
Mathias Asplin Big data analytics and processing
Figure 3: Data Ingestion walltime average over 5 runs
7.2.3 Duckdb
Duckdb already seemed to interpret the string column of explicit as a boolean so that operation
could not be tested. It performed better than the other libraries, except for Polars.
7.2.4 Daft
Daft seemed to do better when doing all of the operations at once. Even though it was slower than
Polars, it is clear the query plan massively improved the execution time as it executed all of the
operations faster than the first one.
7.2.5 Dask
Dask once again required a longer time to execute the query than Pandas, which is surprising
because all of the operations are embarrassingly parallel.
7.3 Data IO
7.3.1 Pandas
Pandas C engine csv reader and NumPy data type backend was slower than the other libraries
according to 6. However the PyArrow read and write implementations were on par with the other
libraries.
7.3.2 Polars
Polars has a rust implementation of the parquet reader and writer, but this had mixed results,
being slower to read but faster to write for the eager mode and the opposite for the lazy mode.
This could be explained by the lazy mode requiring to read and then write the result back to file
however, since the source was the file itself. There is a roughly 3x speed improvement when using
the rust implementation.
21
Mathias Asplin Big data analytics and processing
Figure 4: Data Wrangling walltime average over 5 runs
7.3.3 Duckdb
Duckdb did not stand out in many ways, being average all around both when it came to reading
and writing the two file formats.
7.3.4 Daft
Daft stood out when it was reading the parquet files, being more than 10x faster than the other
libraries. This is probably not true and Daft is simply reading the metadata from the parquet files
or only reading a part of the data.
7.3.5 Dask
Dask was again slower than Pandas, but this time not by a lot. The parquet reader and writer was
on par with the other libraries, but like with Pandas, this is probably attributed to the PyArrow
format.
7.3.6 Files
From figure 5the file sizes of each of the libraries seemed to be the same, with some slight variance.
The csv files were 1.7GB ±50MB except for Daft. For the parquet files they varied more. The
Duckdb and Dask files were 725MB, the Polars (Rust) was 623MB, the Pandas, Polars (PyArrow)
and Daft files were 515MB. All of the files had the same compression format, snappy. Overall
it seemed like the parquet files were about 3x smaller than the csv files when using the snappy
compression.
22
Mathias Asplin Big data analytics and processing
Figure 5: File sizes in megabytes
7.4 Data Ingestion, Wrangling and IO summary
7.4.1 General overview
The overview provided in figure 7shows how some libraries stand out more. Together with the
display of the best performance in figure 8it is clear that Polars outperform the other libraries
in a general case for the common tasks of input/output, cleaning data and sorting it. The only
exception is Daft being faster with reading and writing parquet files but this is not necessarily the
case as previously stated.
Even though Dask and Daft are distributed libraries, Daft seems to outperform Dask for these
benchmarks in a single node environment. Both Duckdb and Daft seem to be generally slower
than Polars but only by a few factors, rather than the 100x difference between Polars and Dask
for the Data Wrangling tasks for example.
7.4.2 CPU utilization
When looking at an overview of the CPU utilization in figure 9and the best performance in figure
10 it gives another angle of how these libraries uses resources. High CPU utilization can both be
a good or a bad thing, but in this context, the mere ability to fully use the CPUs available is a
good sign. While both Polars and Duckdb has good CPU utilization, Polars is able to optimize the
queries to make them run faster or alternatively, the Duckdb implementation of this benchmark is
not optimal.
23
Mathias Asplin Big data analytics and processing
Figure 6: Data IO walltime average over 5 runs
7.5 Data Exploration/Analytics
7.5.1 Small dataset
The small dataset in figure 11 is the smallest tested and the only dataset to entirely fit into memory.
It contained roughly 7.7M rows and common for all datasets, 24 columns. For the eager dataframes
it was first loaded into memory. For the lazy dataframes it was sourced from a parquet file. Daft in
memory and out of core solutions would work the same, which is why only one version is present.
At first glance it is clear that Dask is considerably slower than the other libraries. Theoretically
this should not persist due to Dask being able to parallelize the computations. Being strictly worse
than Pandas, no matter the implementation, seems odd. There are possibilities of misconfigurations
but they should be this evident.
Duckdb was faster than Polars this time around, for almost all tasks. This could be evident by
the fact that Duckdb is not made for data engineering tasks, to clean data, order it around and
merge files, but is instead built as an OLAP which these analytics tasks represent much better.
Polars on the other hand is made for both, which is why it is not as well optimized as Duckdb but
is also able to outperform Duckdb with the data engineering tasks.
While the in memory Duckdb implementation was faster than the out of core one, Polars was
the opposite with the out of core implementation being faster than the in memory version for
most tasks. Notably only the mean, median and standard deviation aggregations were slower.
This could be explained by Polars being able to leverage the metadata from parquet files more
effectively whereas Duckdb might have similar indexing mechanisms internally.
Daft could not perform task 16 due to median not being implemented and Daft seemed to
struggle with task 7 too, which was considerably slower than the Pandas implementation. The
task itself was not unusual either, it had simple operations such as unique, size and sort values.
It is probably the unique operation Daft struggles with. As it is a distributed framework, it has
24
Mathias Asplin Big data analytics and processing
Figure 7: Data summary for ingestion, wrangling and I/O
to optimize for a distributed solution which has poor performance on a single node. For the other
operations Daft seemed to keep up with Polars.
7.5.2 Medium dataset
The medium dataset in figure 12 had roughly 38.5M rows and in its compressed form required
3.6GB, which would be roughly 11GB in a csv format. This dataset does not fit into memory as
it would require more than 20GB of RAM. Because of this, Pandas can not handle the data or
perform any analytics on it.
Polars seems to struggle with the in memory implementation. While Polars can swap to disk,
it seems to be an expensive operation. There are some tasks it keeps up with Duckdb and Daft
on but in general it is slower. The out of core implementation seems to be slightly better than
Daft on this scale, with a few tasks that takes a long time to complete such as task 1 and task 11.
Overall it seems like the out of core implementation performs better than the in memory one at
this scale, but it highly depends on the task.
Duckdb manages to be performant on this scale but the in memory implementation struggles
with task 11 and the out of core implementation shows sign of struggling but manages to complete
the query as second fastest.
Daft can not complete task 7 which is no surprise considering it struggled on the small dataset.
Other than that it seemed to keep up pace with Polars and was about 2-3x slower than Duckdb
overall.
Dask is starting to perform better, even outperforming Polars for some tasks, while struggling
for others. Task 7 requires over 24 minutes to complete, compared to Duckdb which only requires
roughly 5 seconds.
25
Mathias Asplin Big data analytics and processing
Figure 8: Data summary best performance for ingestion, wrangling and I/O
7.5.3 Large dataset
The large dataset in figure 13 had roughly 192.5M rows and required 18.1GB in its compressed
format. This is roughly 54GB in a csv format.
Both Polars in memory implementation and Dask can not run the queries at this scale. Polars
out of core implementation is able to run the queries but much like the Duckdb in memory im-
plementation, they struggle with them. Both Polars and Duckdb are unable to run task 11 and
Polars out of core is unable to run task 1. Daft is able to run the queries but it struggles on task
1. It is the only library that can run task 11 at this scale too.
7.5.4 Very large dataset
The very large dataset in figure 14 had roughly 1.15B rows and required 108.8GB to store in its
compressed form. This is roughly 326GB in a csv format.
Only the Duckdb out of core implementation and Daft are able to run queries. Duckdb is
unable to run tasks 11 and 16 while Daft is unable to run tasks 1, 7, 11 and 16. Both frameworks
seems to be on par with eachother and the only big difference is on task 8 where Daft is very slow
compared to Duckdb.
7.5.5 Many files dataset
The very large dataset split up into 150 files in figure 15 had the same amount of rows and required
the same amount of space as the very large file.
Both Duckdb and Daft were able to run all of the queries in relative terms about the same
time. The only major difference is task 8 for Daft which is now about 15x faster.
26
Mathias Asplin Big data analytics and processing
Figure 9: Data summary of CPU utilization for ingestion, wrangling and I/O as a factor of CPU
time divided by walltime
7.6 Data Analytics summary
7.6.1 General overview
When looking at the overall summary between the different tools in figure 16 it is much clearer
that Duckdb is overall faster until the dataset becomes very large, at which point Daft catches up.
Daft had a query when sourcing from a single very large file that it struggled with, which makes
it seem like multiple files is a lot faster, but it is only that single query that stands out.
Even though it seems like the medium dataset is faster than the small dataset for some tools,
this is not the case as Daft did not manage to complete one of the tasks which is present in the
small dataset. The results are calculated between the tools rather than between the file sizes.
When looking at the individual performance summary in figure 17 it is very clear that Duckdb
is fastest in the majority of queries and it is only when the data size becomes very large that Daft
becomes faster. Relatively speaking, they are very similar in speed however.
7.6.2 CPU utilization
In the figure 18 for the CPU utilization there is a clear trend of both Polars and Duckdb spending
less time utilizing the CPU as the datasets grow larger. Strangely enough, both Dask and Daft
seem to increase their CPU utilization, but as the dataset becomes very large Daft once again
spends less with utilizing the CPU.
When observing figure 19 it is clear that Duckdb is able to utilize the CPU better than other
tools. While this did not lead to better performance than Polars for the data ingestion, wrangling
and I/O tests, it seemed to contribute to the better performance for analytics. It is only when the
datasets become large and very large that Daft seem to utilize the CPUs better, but like Polars
27
Mathias Asplin Big data analytics and processing
Figure 10: Data summary of best CPU utilization performance for ingestion, wrangling and I/O
and Duckdb, this did not lead to a performance increase over Duckdb for the large dataset, only
for the very large dataset.
7.6.3 Walltime difference between datasets
When observing the figures 20,21,22,23 there is a clear trend of a non-linear growth as the
dataset becomes larger. There are some exceptions that can be explained by the tools or files
holding metadata. If the increase is high it is a good sign that the tool is struggling to handle the
dataset and queries. Not every tool follows the same pattern. Dask seems to be faster at analyzing
the medium dataset than the small dataset. The small dataset was loaded into memory as it fits
in memory so Dask might perform better if lazily loaded from parquet files instead. Daft seems to
have the slowest growth, indicating an ability to handle larger datasets. However, Daft is also the
library that is unable to complete the most queries.
28
Mathias Asplin Big data analytics and processing
Figure 11: Data Analytics small dataset walltime average over 5 runs
Figure 12: Data Analytics medium dataset walltime for one run
29
Mathias Asplin Big data analytics and processing
Figure 13: Data Analytics large dataset walltime for one run
Figure 14: Data Analytics very large dataset walltime for one run
30
Mathias Asplin Big data analytics and processing
Figure 15: Data Analytics many files dataset walltime for one run
Figure 16: Data Analytics summary for all runs, compared between datasets
31
Mathias Asplin Big data analytics and processing
Figure 17: Data Analytics best performance summary for all runs, compared between datasets
Figure 18: Data Analytics summary of CPU utilization as a factor of CPU time divided by walltime
for each dataset size
32
Mathias Asplin Big data analytics and processing
Figure 19: Data Analytics summary of best CPU utilization performance between the tools for
each dataset size
Figure 20: Data Analytics time increase as a factor when going from a small dataset to a medium
dataset. A growth of 5x means it is linear.
33
Mathias Asplin Big data analytics and processing
Figure 21: Data Analytics time increase as a factor when going from a medium dataset to a large
dataset. A growth of 5x means it is linear.
Figure 22: Data Analytics time increase as a factor when going from a large dataset to a very large
dataset. A growth of 6x means it is linear.
34
Mathias Asplin Big data analytics and processing
Figure 23: Data Analytics time increase as a factor when going from a large dataset to a very large
dataset in many files. A growth of 6x means it is linear.
35
Mathias Asplin Big data analytics and processing
8. Discussion
8.1 Benchmarking
8.1.1 Overview
This paper only scratched the surface of the possibilities of these tools in an environment that is
realistic for data exploration rather than automated pipelines. As such, the results are limited in
scope and resources. While these tools can scale, that also depends on the environment. The tools
can handle data that is larger than memory and while some queries begin to struggle regardless of
tool, the dataset has to be many times the size of RAM before degradation is noticeable.
A quick summary of the results would boil down to Polars being seemingly better at data
engineering tasks while Duckdb and Daft are able to scale the analytics better This simply scratches
the surface of the how these tools can be used. Parquet is a major component not only in storing
the data in a compressed format so it requires less space but it also provides metadata that allows
these tools to more effectively query the data.
The tools also have many different settings, options and configurations to use but without
expertise it is difficult to fully utilize them. Polars would probably be more performant if the
streaming mode is used and while it is one configuration, it is not entirely obvious if the lazy
mode does not already use streaming mode. Duckdb converts the data into its own internal format
by default when creating a table, but when creating a view it uses the format of the files. Even
though this difference is not obvious, it was required to even measure the out of core performance
of Duckdb because unlike Polars, it is not ambiguous.
8.1.2 Continous improvements
It is important to emphasize that these tools are constantly being improved. They are still very
new and Polars is public about their roadmap to improve the tool while Duckdb, Daft and Dask
either do not have one or it is not public [52]. These improvements can be big, such as GPU
support, a better streaming engine or a better parquet reader. They can also be small, such as
supporting more API operations or more options for the API. As these tools mature they will
become more integrated into their environments and become more stable.
8.2 Practical considerations
Speed and ability to perform out of core computing are two of the most important aspects when
working with big data but they are not the only aspects. A part of the results has shown that
Pandas, while created in 2008, also improves constantly with the integration of the PyArrow
backend and PyArrow parquet readers in 2022 which shows massive speed improvements with
I/O operations compared to NumPy [53]. In fact, reading csv files proved to be fastest with the
PyArrow backend in Pandas, barely being faster than the Polars version. If the use case is ingesting
files, joining them and writing to files it seems like a good option, especially considering Pandas
has many options in the csv reader that can be used to handle data from the source. These options
might be necessary for messy data or data in encodings that the newer tools does not support yet.
Dask, while seemingly the worst alternative, is able to support almost all of Pandas API
operations and perform them as an out of core solution even on a single machine. This is great for
backwards compatibility with earlier systems built with Pandas.
Many of the principles can be applied to distributed computing too. Streaming as a concept and
parquet files are not unique to these tools. The advantages come in the simplistic implementation
of the tools. Rather than setting up Hadoop backend with Spark runtimes on multiple machines
which may or may not have enough memory in each node to run the query, it can be streamed
inside a single node. The overhead both in the skills required to write and run the programs is
36
Mathias Asplin Big data analytics and processing
almost completely removed. If a data analyst or data engineer knows of Pandas or SQL, they can
use Polars, Duckdb or Daft to scale the data analytics or processing needs.
8.3 Practical limitations
Major limitations in this paper was the storage and computing capacity. Storing hundreds of
gigabytes of data is difficult to do even on modern computers. The total size of the data files is
250GB in its compressed form and the author could not reliably store it on a laptop and run it, due
to the hard drive not having enough space. The tools, even if they use streaming mode, can swap
data to disk, which lead to memory issues. Testing how well the parquet format holds against other
formats such as csv, hdf5, feather, pickle and avro is interesting but there is already consensus that
parquet is more performant, better at compression and holds metadata that is useful for analytics
workloads [54], [55]. Parquet is not perfect however and there are other formats that competes
with it such as Apache ORC and Lance. These file formats are either not supported by the tools or
do not have integrations with them yet. The Lance format, while optimized for Machine Learning
workloads, promises >100x speed improvement over parquet for random access workloads [56].
Computing capacity was also a limitation because of the reason stated above. The vertical
scaling is supposed to be simulated by the increase in workload sizes but this is not entirely
accurate. It would be interesting to test the datasets and much larger ones, going into the 10s or
100s of terabyte range, be tested in a cloud environment. The advancements of cloud technology
enables machines with tens of terabytes of RAM and hundreds of cores. For embarassingly parallel
workloads that data analytics usually are, they should give a considerable speedup and be able to
scale the workloads in a performant manner. If 4 cores and roughly 6GB of usage RAM is enough
to perform most queries on datasets with over 1B rows, using 100 cores and 600GB RAM should be
able to handle over 100B of rows, in theory. The cloud also has the advantage of networked disks
with object store that is able to store an unlimited amount of files according to AWS. Realistically,
storing large datasets as one file carries no real advantage and instead it could be divided into
multiple files, for example organized in folders according to date if it is time series data.
This paper does not include any programmer optimizations that is usually done in big data
environments. This can be separating files into a folder hierarchy and combining them for a query,
batch processing files, storing metadata about the data or using mathematical formulas to avoid
expensive computations. Again, while it is interesting to explore exactly how fast one can query
over a billion rows, it is not as necessary as exploring the baseline and workloads under stress.
8.4 Costs and relation to cloud computing
When discussing the cost there are many things to consider. The first and obvious part is how a
decrease in execution time leads to lower costs and the second obvious part is using the smallest
instance possible. On-premise solutions would not be bound to cost but merely speed and if the
query can be executed. For cloud computing solutions it gets more complicated. Apart from
Pandas and Polars, where a GPU would be relevant, the important aspects to consider is the
amount of CPU cores, primary memory as well as networking if the data is stored in the object
storage S3, which is the most typical scenario.
Since the cost scales relatively linear with the amount of cores and memory available on the
instance, a simple way to reduce the cost even further is to use a larger instance. According to
figures 20,21,22,23 the scaling is non-linear in a general case as the dataset grows, which means
using a larger instance will lower the costs. Taking Duckdb out of core as an example, the scaling
was roughly linear at 5x the size of the data, meaning a median of 5x longer walltime to compute.
As the dataset grew by 5x again, the median walltime jumped to around 9x. As the dataset grew
by 6x, the median walltime is at around 22x. It seems like a cost optimal size is a few times larger
than RAM memory before the walltime begins to increase more rapidly.
37
Mathias Asplin Big data analytics and processing
When looking at available EC2 instances, AWS lists u7inh-32tb.480xlarge which has up to 1920
virtual CPUs and 32.768TB of memory. By doing approximate and rough calculations, assuming
the ratio stays relatively the same, such a machine could handle over one petabytes of data and a
few trillion rows of the same dataset. This is also the size Spark works reliably within [57].
8.5 New perspectives on big data
Much like Fivetran’s study, other analysis on big data systems have concluded the same thing,
the vast majority of use cases are not big data at all and do not require big data platforms [8],
[58]–[60]. Motherduck, a company that provides cloud database services using Duckdb claims the
following based on a paper from the Redshift team at AWS [45].
95% of databases don’t qualify as Big Data and 99.98% of users never run big data queries.
When people have big data, 99% of their queries are against smaller tables.
99.5% of queries over “big data” tables, query “small data” amounts.
From the above results, you need to use big data tools 0.5% of 1% of 5% of the time, which
is something like the “big data 0.00025%”.
This puts big data in companies into perspective for real world use cases. If companies uses
the same solution and heavier cost for each of these use cases they have a lot of money to save on
understanding their data pipelines.
Motherduck’s definition of big data differs strongly as they consider 10TB of data ”small”.
This can mean databases with tables that has billions of rows which could require complex joins
with other tables. They do not consider data processing considerations either, as the data could
be stored in data lakes and only refined data is stored in a data warehouse like Redshift.
Healey K. write his opinions on the issue in a blog post [60].
“Both [Snowflake, Databricks] of these platforms are charging their customers for Fer-
raris when most companies would be happy with a Toyota Camry with some After Mar-
ket parts. Don’t get me wrong, both platforms are doing impressive things; Snowflake’s
Data Cloud is really coming to life with their partnership with NVIDIA and Databricks
is putting runs on the board with the Spark Human API. These are not reasons you
should use their platforms. These are marketing gimmicks to garner the attention of
the decision-makers to loosen the company’s purse strings. This is a fundamental mis-
take in the way businesses operate. Pay to make a problem disappear. We operate as if
these platforms solve problems when fundamentally they do not and often exacerbate
large inefficiencies that already exist within an organization. If you want to make a big
impact be Anti-Hype.”
Other blog posts agree with Healey such as Sadeghi A who also refers to Motherduck and the
Redshift team at AWS and writes as a conclusion [45], [58].
“The rise of single-node processing engines represents a pragmatic shift in data ana-
lytics. As hardware capabilities continue to advance and single-node engines become
more sophisticated, the need for distributed processing will likely continue to decrease
for most organisations.
For the vast majority of companies, single-node processing frameworks offer a more
efficient, cost-effective, and manageable solution to their data analytics needs. As we
move forward, the key is not to automatically reach for distributed solutions, but to
carefully evaluate actual workload requirements and choose the right tool for the job.
The future of data processing may well be less about managing clusters and more about
leveraging the impressive capabilities of modern single-node systems.”
38
Mathias Asplin Big data analytics and processing
Sadeghi goes further in depth regarding the aging of data and workload patterns, stating that
based on Meta and eBay data access patterns, most data is accessed in the first 48 hours by ETL
pipelines and 95% of queries on the data are executed within the first 30 days. They further
emphasize that.
“This pattern holds remarkably consistent across industries and use cases. Even in
organisations with large datasets, most analytical workloads operate on recent, aggre-
gated data that could easily fit within single-node processing capabilities.”
8.6 Business considerations and dynamic data storage
The reason why companies go for solutions such as Snowflake and Databricks depends but it is
clear that costs and complexity can be drastically reduced by leveraging a solution that enables
data storage scaling while also being able to both utilize vertical and horizontal scaling methods for
the data processing and analytics. The downsides of data warehouses is that for lower data loads
there still needs to be an instance running. The downsides of managed platforms like Snowflake
and Databricks is being bound to their pricing model and compute engine.
Apache Iceberg is an example of an analytic table format that works as a data lakehouse [28].
It enables separation of storage and compute which essentially means data lake in the back and
data warehouse in the front. Such formats provides the means of using the simple approach of
vertical scaling for most query needs and only going horizontally when absolutely required.
While there are additional data management means that are useful, this further emphasizes
how the perspective on big data engineering is constantly shifting and why a dynamic approach to
data storage and processing is needed. Motherduck refers to trends going back to ”small data”, this
paper proposes that Polars, Duckdb and Daft enables big data analytics and processing everywhere
[59].
It is not just the fast evolution of cloud services and hardware that has made companies slow
to adapt to vertical scaling. Data warehouses are usually locked in solutions or vendor specific,
making a migration difficult. In addition to this, a ”de facto” standard of Hadoop and Spark has
made companies comfortable with the tools even for smaller scale operations. Horizontal scaling
will definitely be relevant in the future, but a new mindset of how data is stored and how it can
be accessed by different compute engines is needed for vertical scaling big data tools to become
standardized.
8.7 Big data in universities
A final aspect to consider is what is taught in universities. alardalen’s University (MDU) has
multiple courses on the topic of big data, either as a complement for another course component
or as a core aspect of the course itself. Out of the courses available on MDU’s website, only two
mentions the tools used for big data and both of those refer to Hadoop and Spark. One of which
even writes “The de facto Big Data processing frameworks: Hadoop and MapReduce”. While
teaching about distributed tools and methods is important, not exploring the single node methods
and pushing them to their limits gives a false view of how big data processing and analytics works
not just in theory but in practice too. This is not unique to MDU either. Uppsala University
(UU) and The Royal Institute of Technology (KTH) also has courses designed in a similar fashion
[61]–[64].
39
Mathias Asplin Big data analytics and processing
9. Conclusions
9.1 General conclusions
Multiple new tools and concepts have emerged in recent years to improve on parallel and distributed
computing for big data analytics and processing. These tools offers new ways of processing data
by introducing out of core computing concepts which can effectively work with data larger than
memory by combining new file formats that can both store metadata that helps with analytics
workloads and compress the data to take an average 3x less space compared to the csv format.
By using a streaming or hybrid streaming approach tools such as Polars, Duckdb, Daft and to
some extent Dask are able to analyze and process larger than RAM datasets. On a 4 core, 16GB
RAM machine in a busy environment where around 8GB memory is used by other processes,
Duckdb and Daft managed to do analytics and process a parquet file with the size of above 100GB
as well as 150 parquet files totalling about 100GB. These files contained above 1 billion rows.
9.2 Pandas evolution
Pandas, when using PyArrow backend, managed an above 10x speedup of computations such as
I/O of csv files and dataframe joins compared to the NumPy backend. For analytics and data
processing operations the PyArrow backend is faster for most operations involving numeric data
types but not string data types. Pandas still can not handle larger than memory datasets however.
9.3 Polars
Polars, a new dataframe tool with both eager and lazy modes, was able to perform better than
the other tools for basic data engineering tasks such as data ingestion, data wrangling and data
I/O. Polars is also able to handle larger than memory datasets but as the dataset grew larger it
was unable to keep up with Duckdb and Daft.
9.4 Duckdb and Daft
Both Duckdb and Daft were able to handle very large datasets, both from single files and multiple
files. While Daft performed worse than the other tools for the small datasets, it generally performed
the best for very large datasets. Duckdb proved itself in being very performant both for small and
very large datasets. The tools had their limitations on certain query tasks, with Daft being unable
to perform 4 out of 18 and Duckdb being unable to perform 2 out of 18.
9.5 Paired with cloud computing
With advancements in cloud computing, vendors such as Amazon Web Services now offers single
node machines with up to thousands of cores and tens of terabytes of RAM. In theory, the new
tools and file format in combination with services such as EC2 and S3 are able to handle datasizes
up to one petabyte and trillions of rows.
Together, tools such as Polars, Duckdb and Daft are able to perform data analytics and process-
ing that previously was only possible in a distributed environment such as Hadoop or Spark. This
can in theory reduce code complexity, environment complexity, computing costs and dependence
on 3rd party services such as SAP, Snowflake and Databricks while still maintaining the ability to
handle big data effectively. The tools also allow for a lower skill ceiling enabling data analysts, data
scientists and data engineers without knowledge of distributed systems and solutions to handle big
data.
9.6 Future Work
This paper focused on a general overview of whether vertical scaling and out of core processing
is a possibility for big data analytics and processing. As such, it included tools not designed for
40
Mathias Asplin Big data analytics and processing
big data processing or vertical scaling such as Pandas and Dask in order to give a fair comparison
to what was available when distributed computing became a defacto standard. In future work,
exploring the three tools that performed the best, Polars, Duckdb and Daft would be beneficial.
A few examples include.
Comparisons with distributed frameworks such as Spark and Dask on equal resources.
Comparing Daft’s performance in a distributed environment versus a single node on equal
resources.
Comparing numeric operations versus string operations.
Explore these tools ability to process data on larger datasets above 10TB both on smaller
and larger instances.
Explore different operations such as joins on tables above one billion rows.
Explore the SQL interfaces of Polars and Daft.
41
Mathias Asplin Big data analytics and processing
References
[1] The world’s most valuable resource is no longer oil, but data, Accessed: 2025-01-05. [Online].
Available: https : / / www . economist . com / leaders / 2017 / 05 / 06 / the - worlds - most -
valuable-resource-is-no-longer-oil-but-data.
[2] A short history of big data, Accessed: 2024-12-22. [Online]. Available: https://www.bigdataframework.
org/knowledge/a-short-history-of-big-data/.
[3] The most common problems companies are facing with their big data analytics, Accessed:
2024-12-22. [Online]. Available: https://barc.com/challenges-big-data-analytics/.
[4] Top 5 challenges for hadoop mapreduce in the enterprise, Accessed: 2024-12-22. [Online].
Available: https://www.bigdatawire.com/2012/10/29/top_5_challenges_for_hadoop_
mapreduce_in_the_enterprise/.
[5] 5 common performance problems in apache spark: How they impact your job execution time,
Accessed: 2024-12-22. [Online]. Available: https://vivekjadhavr.medium.com/5-common-
performance- problems- in- apache- spark- how- they- impact- your- job- execution-
time-850974e9af24.
[6] Key concepts & architecture, Accessed: 2025-01-05. [Online]. Available: https : / / docs .
snowflake.com/en/user-guide/intro-key-concepts.
[7] Supercomputing vs. distributed computing: A government primer, Accessed: 2025-01-05. [On-
line]. Available: https://datainnovation.org/2014/01/supercomputing-vs-distributed-
computing-a-government-primer/.
[8] How do people use snowflake and redshift? Accessed: 2025-01-05. [Online]. Available: https:
//www.fivetran.com/blog/how-do-people-use-snowflake-and-redshift.
[9] Amazon ec2 instance types, Accessed: 2024-12-22. [Online]. Available: https://aws.amazon.
com/ec2/instance-types/.
[10] What is big data? Accessed: 2024-09-29. [Online]. Available: https://www.oracle.com/big-
data/what-is-big-data/.
[11] Big data: What it is and what you need to know, Accessed: 2024-09-29. [Online]. Available:
https://www.informatica.com/se/resources/articles/what-is-big-data.html.
[12] What is big data? Accessed: 2024-09-29. [Online]. Available: https://cloud.google.com/
learn/what-is-big-data.
[13] Google’s infringenovation secrets, Accessed: 2025-01-05. [Online]. Available: https://www.
forbes.com/sites/scottcleland/2011/10/03/googles- infringenovation- secrets/
?sh=4add793f30a6.
[14] The history of hadoop, Accessed: 2025-01-05. [Online]. Available: https:// medium.com/
@markobonaci/the-history-of-hadoop-68984a11704.
[15] Apache spark and hadoop working together, Accessed: 2025-01-05. [Online]. Available: https:
//www.databricks.com/blog/2014/01/21/spark-and-hadoop.html.
[16] Apache spark history, Accessed: 2025-01-05. [Online]. Available: https://spark.apache.
org/history.html.
[17] Excel specifications and limits, Accessed: 2025-01-05. [Online]. Available: https://support.
microsoft.com/en- us/office/excel- specifications- and- limits- 1672b34d- 7043-
467e-8e27-269d656771c3.
[18] What is the python global interpreter lock (gil)? Accessed: 2025-01-05. [Online]. Available:
https://realpython.com/python-gil/.
[19] Pandas, Accessed: 2025-01-06. [Online]. Available: https : / / www . nvidia . com / en - us /
glossary/pandas-python/.
[20] Numpy, Accessed: 2025-01-05. [Online]. Available: https://numpy.org/.
[21] W. McKinney, “Data structures for statistical computing in python,” Jan. 2010, pp. 56–61.
doi:10.25080/Majora-92bf1922-00a.
42
Mathias Asplin Big data analytics and processing
[22] The 10 leading cloud service providers of 2024, Accessed: 2025-01-06. [Online]. Available:
https://www.cloudoptimo.com/blog/the-10-leading-cloud-service-providers-of-
2024/.
[23] Amazon s3, Accessed: 2025-01-06. [Online]. Available: https://aws.amazon.com/s3/.
[24] M. Zaharia, A. G. 0002, R. Xin, and M. Armbrust, “Lakehouse: A new generation of open
platforms that unify data warehousing and advanced analytics,” in 11th Conference on In-
novative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online
Proceedings, www.cidrdb.org, 2021. [Online]. Available: http : / /cidrdb.org/cidr2021 /
papers/cidr2021_paper17.pdf.
[25] Amazon ec2, Accessed: 2025-01-06. [Online]. Available: https://aws.amazon.com/ec2/.
[26] Aws fargate, Accessed: 2025-01-06. [Online]. Available: https://aws.amazon.com/fargate/.
[27] What’s the difference between olap and oltp? Accessed: 2025-01-05. [Online]. Available: https:
//aws.amazon.com/compare/the-difference-between-olap-and-oltp/.
[28] Apache iceberg, Accessed: 2025-01-06. [Online]. Available: https://iceberg.apache.org/.
[29] Parallel python fast and easy, Accessed: 2025-01-06. [Online]. Available: https://www.dask.
org/.
[30] Dask, Accessed: 2025-01-06. [Online]. Available: https://github.com/dask/dask.
[31] M. Rocklin, “Dask: Parallel computation with blocked algorithms and task scheduling,” Jan.
2015, pp. 126–132. doi:10.25080/Majora-7b98e3ed-013.
[32] Duckdb, Accessed: 2025-01-06. [Online]. Available: https://duckdb.org/.
[33] Duckdb what’s the hype about? Accessed: 2025-01-06. [Online]. Available: https : / /
betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196.
[34] M. Raasveldt and H. uhleisen, “Duckdb: An embeddable analytical database,” in Pro-
ceedings of the 2019 International Conference on Management of Data, ser. SIGMOD ’19,
Amsterdam, Netherlands: Association for Computing Machinery, 2019, pp. 1981–1984, isbn:
9781450356435. doi:10.1145/3299869.3320212. [Online]. Available: https://doi.org/
10.1145/3299869.3320212.
[35] Dataframes for the new era, Accessed: 2025-01-06. [Online]. Available: https://pola.rs/.
[36] R. Vink, S. de Gooijer, A. Beedie, et al.,Pola-rs/polars: Python polars 1.19.0, version py-
1.19.0, Jan. 2025. doi:10.5281/zenodo.14597402. [Online]. Available: https://doi.org/
10.5281/zenodo.14597402.
[37] Unified engine for data engineering, analytics & ml/ai, Accessed: 2025-01-06. [Online]. Avail-
able: https://www.getdaft.io/.
[38] What is parquet? Accessed: 2025-01-06. [Online]. Available: https://www.databricks.com/
glossary/what-is-parquet.
[39] Apache arrow overview, Accessed: 2025-01-06. [Online]. Available: https://arrow.apache.
org/overview/.
[40] Eager evaluation, Accessed: 2025-01-11. [Online]. Available: https : / /www.dremio.com /
wiki/eager-evaluation/.
[41] The history of databases, Accessed: 2025-01-06. [Online]. Available: https://www.thinkautomation.
com/histories/the-history-of-databases.
[42] K. Ayushman, “Faster reading with duckdb and arrow flight on hopsworks: Benchmark
and performance evaluation of offline feature stores,” M.S. thesis, KTH Royal Institute of
Technology, 2023. [Online]. Available: https://urn.kb.se/resolve?urn=urn:nbn:se:
kth:diva-337297.
[43] R. Atwal, P. Boncz, R. Boyd, et al., “Motherduck: Duckdb in the cloud and in the client,”
[Online]. Available: https://www.cidrdb.org/cidr2024/papers/p46-atwal.pdf.
[44] L. Kuiper, M. Raasveldt, and H. M¨uhleisen, “Efficient external sorting in duckdb,” [Online].
Available: https://ceur-ws.org/Vol-3163/BICOD21_paper_9.pdf.
43
Mathias Asplin Big data analytics and processing
[45] A. van Renen, D. Horn, P. Pfeil, et al., “Why tpc is not enough: An analysis of the amazon
redshift fleet,” in VLDB 2024, 2024. [Online]. Available: https://www.amazon.science/
publications / why - tpc - is - not - enough - an - analysis - of - the - amazon - redshift -
fleet.
[46] F. R. Calkoen, A. P. Luijendijk, K. Vos, E. Kras, and F. Baart, “Enabling coastal analytics
at planetary scale,” Environmental Modelling & Software, vol. 183, p. 106 257, 2025, issn:
1364-8152. doi:https://doi.org/10.1016/j.envsoft.2024.106257. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1364815224003189.
[47] M. Bakken and A. Soylu, “Chrontext: Portable sparql queries over contextualised time series
data in industrial settings,” Expert Systems with Applications, vol. 226, p. 120 149, 2023, issn:
0957-4174. doi:https://doi.org/10.1016/j.eswa.2023.120149. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0957417423006516.
[48] J. Wang, C. Xu, J. Zhang, and R. Zhong, “Big data analytics for intelligent manufacturing
systems: A review,” Journal of Manufacturing Systems, vol. 62, pp. 738–752, 2022, issn:
0278-6125. doi:https://doi.org/10.1016/j.jmsy.2021.03.005. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0278612521000601.
[49] Ibis, Accessed: 2024-12-27. [Online]. Available: https://ibis-project.org/.
[50] Narwhals, Accessed: 2024-12-27. [Online]. Available: https : / / github . com / narwhals -
dev/narwhals.
[51] mcfurland, 10+ m. beatport tracks / spotify audio features, 2023. [Online]. Available: https:
//www.kaggle.com/ datasets / mcfurland / 10- m- beatport- tracks- spotify- audio-
features.
[52] Ritchie vink - polars 1.0 and beyond pydata amsterdam 2024, Accessed: 2025-01-11. [On-
line]. Available: https://www.youtube.com/watch?v=yYAVrVMGaMY.
[53] Release notes, Accessed: 2025-01-11. [Online]. Available: https:// pandas .pydata.org /
docs/whatsnew/index.html.
[54] Avro vs. parquet, Accessed: 2025-01-11. [Online]. Available: https://www.snowflake.com/
trending/avro-vs-parquet/.
[55] Choosing the right big data file format: Avro vs. parquet vs. orc, Accessed: 2025-01-11. [On-
line]. Available: https://blog.det.life/choosing-the-right-big-data-file-format-
avro-vs-parquet-vs-orc-c868ffbe5a4e.
[56] Nimble and lance: The parquet killers, Accessed: 2025-01-11. [Online]. Available: https :
//materializedview.io/p/nimble-and-lance-parquet-killers.
[57] Apache spark faq, Accessed: 2025-01-11. [Online]. Available: https://spark.apache.org/
faq.html.
[58] Redshift files: The hunt for big data, Accessed: 2025-01-12. [Online]. Available: https://
motherduck.com/blog/redshift-files-hunt-for-big-data/.
[59] Big data is dead, Accessed: 2025-01-12. [Online]. Available: https : / / motherduck . com /
blog/big-data-is-dead/.
[60] Cached takes: 80% of companies do not need snowflake or databricks, Accessed: 2025-01-12.
[Online]. Available: https://kjhealey.medium.com/cached- takes-80- of-companies-
do-not-need-snowflake-or-databricks-5ebda64c0853.
[61] Course syllabus - machine learning with big data, Accessed: 2025-01-12. [Online]. Available:
https://www.mdu.se/en/malardalen- university/education/course- syllabus?id=
29200.
[62] Course syllabus - smart digital platforms: Cloud computing, security and big data, Accessed:
2025-01-12. [Online]. Available: https : / / www . mdu . se / en / malardalen - university /
education/course-syllabus?id=30017.
[63] Data engineering i, Accessed: 2025-01-12. [Online]. Available: https ://www. uu .se/en/
study/syllabus?query=47724.
44
Mathias Asplin Big data analytics and processing
[64] Id2221 data-intensive computing 7.5 credits, Accessed: 2025-01-12. [Online]. Available: https:
//www.kth.se/student/kurser/kurs/ID2221?l=en.
A Appendex Python Imports
import time
import duckdb
import daft as ft
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
import dask.dataframe as dd
from datetime import datetime
Listing 1: Python imports
45
Mathias Asplin Big data analytics and processing
B Appendix Pandas Code
engine = [python’, c , pyarrow’ ][0]
dtype backend = [’numpy nullable’, ’pyarrow’][0]
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
df track = pd.read csv(
path to sp track, engine=engine, dtype backend=dtype backend
).drop(’updated on’, axis=1)
df artist track = pd.read csv(
path to sp artist track , engine=engine, dtype backend=dtype backend
).drop(’updated on’, axis=1)
df artist = pd.read csv(
path to sp artist , engine=engine, dtype backend=dtype backend
).drop(’updated on’, axis=1)
df release = pd.read csv(
path to sp release , engine=engine, dtype backend=dtype backend
). drop([ release img , updated on’], axis=1)
df audio features = pd.read csv(
path to audio features , engine=engine, dtype backend=dtype backend
).drop([duration ms’, updated on’], axis=1)
Listing 2: Pandas Data Ingestion Read Code
# Merge/Join the columns to create one large table
df = df track.merge( df artist track , on=’track id, how=’left’)
df = df.merge(df artist , on=’ artist id , how=’left’)
df = df.merge(df release , on=’release id , how=’left’)
df = df.merge(df audio features, on=’isrc , how=’left’)
# Drop last unnecessary columns
df = df.drop(
[ isrc , track number’, release id ,
disc number’, preview url , artist id , upc’ ], axis=1
)
Listing 3: Pandas Data Ingestion Merge/Join Code
# Turn miliseconds into seconds
df[ duration’] = df.duration ms 10∗∗−3
df = df.drop(’duration ms’, axis=1)
Listing 4: Pandas Data Wrangling Duration Code
# Turn string into boolean
df[ explicit ] = df. explicit == ’t’
Listing 5: Pandas Data Wrangling Boolean Code
# Fill dates and turn to datetime
df[ release date ] = df. release date .apply(
lambda x: x + 0101 00:00:00’ if x.count(’’) == 0 else
x + 01 00:00:01’ if x.count(’’) == 1 else
x + 00:00:02
)
df[ release date ] = pd.to datetime(
46
Mathias Asplin Big data analytics and processing
df. release date , format=’mixed’, errors=’coerce’
)
Listing 6: Pandas Data Wrangling Datetime Code
# Find all artist names for tracks
# and combine them then join it with the original dataframe
df = (
df .merge(df.groupby(’track id )[ artist name ]
.apply(list ). reset index (name=’artist name all’), on=’track id’)
)
df[ artist name all ] = df. artist name all . str . join( , )
Listing 7: Pandas Data Wrangling Artists Code
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/pandas.csv’
path pyarrow = ’path/to/input/pandas pa.parquet’
Listing 8: Pandas Data I/O Paths Code
# Write to CSV
df . to csv(path csv, sep=’; )
Listing 9: Pandas Data I/O Write CSV Code
# Write to parquet (pyarrow)
df.to parquet(path pyarrow, engine=’pyarrow’)
Listing 10: Pandas Data I/O Write Parquet Code
# Read from csv
df = pd.read csv(path csv, sep=’; , engine=’c’)
Listing 11: Pandas Data I/O Read CSV Code
# Read from parquet (pyarrow)
df = pd.read parquet(path pyarrow, engine=’pyarrow’)
Listing 12: Pandas Data I/O Read Parquet Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
df[(˜df. artist name all . str .contains(( | ). join ( artists ), case=True, na=False)) &
(df. release date >= datetime(2010, 1, 1)) &
(df. release date <datetime(2020, 1, 1)) &
(˜df . explicit ) &
(df.duration >= 360) &
((df. danceability >= 0.8) |(df .energy >= 0.8)) &
(df.tempo >= df.tempo.mean()) &
(df. liveness <= 0.2)
]. sort values ( popularity , ascending=False).head(10)
Listing 13: Pandas Data Analytics/Exploration Task 1 Code
# Find the artists who has made the most tracks
(
df.groupby(’artist name’). size (). reset index (name=’len’)
47
Mathias Asplin Big data analytics and processing
. sort values ( len , ascending=False)
)
Listing 14: Pandas Data Analytics/Exploration Task 2 Code
# Find the artist that has made the most popular songs on average
(
df.groupby(’artist name’)[ popularity ]. mean().reset index(name=’most popular’)
. sort values ( most popular’, ascending=False)
)
Listing 15: Pandas Data Analytics/Exploration Task 3 Code
# Find the artist who on average makes the most danceable tracks
(
df.groupby(’artist name’)[ danceability ]. mean().reset index(name=’best dance’)
. sort values ( best dance’, ascending=False)
)
Listing 16: Pandas Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
(
df.groupby(’artist name’)[ duration’ ]. mean().reset index(name=’long song’)
. sort values ( long song , ascending=False)
)
Listing 17: Pandas Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
(
df.groupby(’artist name’)[ valence ]. mean().reset index(name=’happy song’)
. sort values ( happy song’, ascending=False)
)
Listing 18: Pandas Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
(
df .groupby(’artist name’ )[ label name’ ]. unique().apply(len)
. reset index (name=’collaborations’).sort values ( collaborations , ascending=False)
)
Listing 19: Pandas Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
(
df[ df. explicit ]. groupby(’artist name’). size ()
. reset index (name=’profanity’).sort values( profanity , ascending=False)
)
Listing 20: Pandas Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
(
df [ year ] = df. release date . dt.year
df.groupby(’year’)[ popularity ]. mean().reset index(name=’most popular year’)
. sort values ( most popular year’, ascending=False)
)
Listing 21: Pandas Data Analytics/Exploration Task 9 Code
48
Mathias Asplin Big data analytics and processing
# Find the artist with the most popular tracks on average per year
df [ year ] = df. release date . dt.year
(
df.groupby([’year , artist name ])[ popularity ]
.mean()
. reset index(name=’most popular artist year’)
. sort values ( most popular artist year , ascending=False)
.drop duplicates(year , keep=’first )
. sort values ( most popular artist year , ascending=False)
)
Listing 22: Pandas Data Analytics/Exploration Task 10 Code
# Find the artist with the most track credits
df temp = (
df.merge(df.drop duplicates( track id )[ artist name all ]
.str .count(’ , )
. reset index (name = artist credits ),
left on =df.index, right on=’index’
)[[ track title , artist credits ]]
)
df temp[ artist credits ] = df temp. artist credits + 1
df temp.sort values ( artist credits , ascending=False).head(10)
Listing 23: Pandas Data Analytics/Exploration Task 11 Code
# Find the artist most active throughout the years
(
df.groupby(’artist name’)[ release date ]. apply(lambda x: x.max() x.min())
. reset index (name = ’longest date’). sort values ( longest date , ascending=False)
)
Listing 24: Pandas Data Analytics/Exploration Task 12 Code
# Find the album type distribution
df temp = df.groupby(’album type’).size(). reset index (name = ’percent’)
df temp[percent ] = (df temp[’percent’] / df temp[’percent’ ]. sum()) 100
df temp
Listing 25: Pandas Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
df.groupby(’album type’)[’popularity’ ]. mean().reset index(name = ’popularity’)
Listing 26: Pandas Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
(
df[
[ danceability , energy’ , liveness , speechiness ,
tempo’, valence , duration]
]. mean().reset index(name = ’mean’)
)
Listing 27: Pandas Data Analytics/Exploration Task 15 Code
# Find the median of different measurements
(
df[
[ danceability , energy’ , liveness , speechiness ,
tempo’, valence , duration]
]. median().reset index(name = ’median’)
)
49
Mathias Asplin Big data analytics and processing
Listing 28: Pandas Data Analytics/Exploration Task 16 Code
# Find the standard deviation of different measurements
(
df[
[ danceability , energy’ , liveness , speechiness ,
tempo’, valence , duration]
]. std (). reset index (name = ’std’)
)
Listing 29: Pandas Data Analytics/Exploration Task 17 Code
# Find out what makes a track popular
(
df.groupby(’popularity’)[
[ danceability , energy’ , liveness , speechiness ,
tempo’, valence , duration]
]. mean()
. reset index (names = ’popularity’).sort values (popularity , ascending=False)
)
Listing 30: Pandas Data Analytics/Exploration Task 18 Code
50
Mathias Asplin Big data analytics and processing
C Appendix Polars In Memory Code
engine = [python’, c , pyarrow’ ][0]
dtype backend = [’numpy nullable’, ’pyarrow’][0]
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/ sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
df track = pl.read csv(path to sp track ). drop(’updated on’)
df artist track = pl.read csv( path to sp artist track ).drop(’updated on’)
df artist = pl.read csv( path to sp artist ).drop(’updated on’)
df release = pl.read csv( path to sp release ). drop([ release img , updated on’])
df audio features = pl.read csv(path to audio features ).drop(
[ duration ms’, updated on’]
)
Listing 31: Polars In Memory Data Ingestion Read Code
# Merge/Join the columns to create one large table
df = df track. join ( df artist track , on=’track id, how=’left’)
df = df.join( df artist , on=’ artist id , how=’left’)
df = df. join( df release , on=’release id , how=’left’)
df = df.join( df audio features , on=’isrc , how=’left’)
# Drop last unnecessary columns
df = df.drop(
[ isrc , track number’, release id , disc number’,
preview url , artist id , upc’]
)
Listing 32: Polars In Memory Data Ingestion Merge/Join Code
# Turn miliseconds into seconds
df = df.with columns(
duration = pl.col( duration ms’) 10∗∗−3
).drop(’duration ms’)
Listing 33: Polars In Memory Data Wrangling Duration Code
# Turn string into boolean
df = df.with columns(
explicit = pl.col( explicit ). replace strict ({ t : True, f : False})
)
Listing 34: Polars In Memory Data Wrangling Boolean Code
# Fill dates and turn to datetime
df = df.with columns(
release date = pl.when(
pl . col( release date ). str .count matches(’’) == 0
).then(
pl . col( release date ) + 0101 00:00:00’
).otherwise(
pl .when(
pl . col( release date ). str .count matches(’’) == 1
).then(
pl . col( release date ) + 01 00:00:01’
).otherwise(
pl . col( release date ) + 00:00:02
)
). str .to datetime(%Y%m%d %H:%M:%S’)
)
51
Mathias Asplin Big data analytics and processing
Listing 35: Polars In Memory Data Wrangling Datetime Code
# Find all artist names for tracks
# and combine them then join it with the original dataframe
df = df.group by(’track id ). agg(
artist name all = pl.col (artist name)
).with columns(
artist name all = pl.col( artist name all ). list . join(separator=’, )
). join(
df, on=’track id’
)
Listing 36: Polars In Memory Data Wrangling Artists Code
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/PolarsInMemory.csv’
path parquet = ’path/to/input/PolarsInMemory.parquet’
path pyarrow = ’path/to/input/PolarsInMemory pa.parquet’
Listing 37: Polars In Memory Data I/O Paths Code
# Write to csv
df. write csv(path csv, separator=’; )
Listing 38: Polars In Memory Data I/O Write CSV Code
# Write to parquet (rust)
df.write parquet(path parquet, compression=’snappy’)
Listing 39: Polars In Memory Data I/O Write Parquet (Rust) Code
# Write to parquet (pyarrow)
df.write parquet(path pyarrow, compression=’snappy’, use pyarrow=True)
Listing 40: Polars In Memory Data I/O Write Parquet (PyArrow) Code
# Read from csv
df = pl.read csv(path csv, separator=’; )
Listing 41: Polars In Memory Data I/O Read CSV Code
# Read from parquet (rust)
df = pl.read parquet(path parquet)
Listing 42: Polars In Memory Data I/O Read Parquet (Rust) Code
# Read from parquet (pyarrow)
df = pl.read parquet(path pyarrow, use pyarrow=True)
Listing 43: Polars In Memory Data I/O Read Parquet (PyArrow) Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
df. filter (
pl. col( artist name all ). str .contains any( artists )) &
52
Mathias Asplin Big data analytics and processing
(pl . col( release date ) >= datetime(2010, 1, 1)) &
(pl . col( release date ) <datetime(2020, 1, 1)) &
pl. col( explicit )) &
(pl . col( duration) >= 360) &
((pl . col( danceability ) >= 0.8) |( pl . col ( energy’) >= 0.8)) &
(pl . col( tempo’) >= pl.col(’tempo’).mean()) &
(pl . col( liveness ) <= 0.2)
). sort(popularity , descending=True).head(10)
Listing 44: Polars In Memory Data Analytics/Exploration Task 1 Code
# Find the artists who has made the most tracks
df .group by(’artist name’ ). len (). sort ( len , descending=True)
Listing 45: Polars In Memory Data Analytics/Exploration Task 2 Code
# Find the artist who on average makes the most danceable tracks
df .group by(’artist name’ ). agg(
best dance = pl.col( danceability ). mean()
). sort( best dance’, descending=True)
Listing 46: Polars In Memory Data Analytics/Exploration Task 3 Code
# Find the artist that makes the longest tracks on average
df .group by(’artist name’ ). agg(
long song = pl.col (duration ). mean()
). sort( long song , descending=True)
Listing 47: Polars In Memory Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
df .group by(’artist name’ ). agg(
long song = pl.col (duration ). mean()
). sort( long song , descending=True)
Listing 48: Polars In Memory Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
df .group by(’artist name’ ). agg(
happy song = pl.col(’valence ). mean()
). sort( happy song’, descending=True)
Listing 49: Polars In Memory Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
df .group by(’artist name’ ). agg(
collaborations = pl.col(label name’).unique().len ()
). sort( collaborations , descending=True)
Listing 50: Polars In Memory Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
(
df. filter ( pl . col ( explicit )). group by(’artist name’)
.len (). sort ( len , descending=True)
)
Listing 51: Polars In Memory Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
df.with columns(
year = pl.col ( release date ).dt.year()
53
Mathias Asplin Big data analytics and processing
). group by(’year’ ). agg(
most popular year = pl.col(popularity ). mean()
). sort(
most popular year’, descending=True
)
Listing 52: Polars In Memory Data Analytics/Exploration Task 9 Code
# Find the artist with the most popular tracks on average per year
df.with columns(
year = pl.col ( release date ).dt.year()
).group by([’year , artist name ]). agg(
most popular artist year = pl.col( popularity ).mean()
). sort(
most popular artist year , descending=True
).unique(
year , keep=’first
). sort ( most popular artist year , descending=True)
Listing 53: Polars In Memory Data Analytics/Exploration Task 10 Code
# Find the artist with the most track credits
df.unique(’track id ). select (
pl . col ( track title ),
artist credits = pl.col( artist name all ). str .count matches(’,) + 1
). sort( artist credits , descending=True).head(10)
Listing 54: Polars In Memory Data Analytics/Exploration Task 11 Code
# Find the artist most active throughout the years
df .group by(’artist name’ ). agg(
pl . col( release date ). max() pl.col( release date ). min()
). sort( release date , descending=True)
Listing 55: Polars In Memory Data Analytics/Exploration Task 12 Code
# Find the album type distribution
df.group by(’album type’).len().with columns(
(pl . col( len ) / pl .sum(’len )). alias ( percent’)
)
Listing 56: Polars In Memory Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
df.group by(’album type’).agg(pl.col( popularity ). mean())
Listing 57: Polars In Memory Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
df. select (
pl .mean(’danceability’, energy’, liveness ,
speechiness , tempo’, valence , duration)
)
Listing 58: Polars In Memory Data Analytics/Exploration Task 15 Code
# Find the median of different measurements
df. select (
pl .median(’danceability’ , energy’, liveness ,
speechiness , tempo’, valence , duration)
)
Listing 59: Polars In Memory Data Analytics/Exploration Task 16 Code
54
Mathias Asplin Big data analytics and processing
# Find the standard deviation of different measurements
df. select (
pl .std( danceability ),
pl . std( energy’ ),
pl . std( liveness ),
pl . std( speechiness ),
pl . std( tempo’),
pl . std( valence ),
pl . std( duration’)
)
Listing 60: Polars In Memory Data Analytics/Exploration Task 17 Code
# Find out what makes a track popular
df.group by(’popularity’ ). agg(
pl . col( danceability ). mean(),
pl . col ( energy ). mean(),
pl . col ( liveness ). mean(),
pl . col ( speechiness ). mean(),
pl . col ( tempo’).mean(),
pl . col ( valence ). mean(),
pl . col ( duration ). mean()
). sort(popularity , descending=True)
Listing 61: Polars In Memory Data Analytics/Exploration Task 18 Code
55
Mathias Asplin Big data analytics and processing
D Appendix Polars Out Of Core Code
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/ sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
df track = pl.scan csv(path to sp track ).drop(’updated on’)
df artist track = pl.scan csv( path to sp artist track ). drop(’updated on’)
df artist = pl.scan csv( path to sp artist ).drop(’updated on’)
df release = pl.scan csv( path to sp release ). drop([ release img , updated on’])
df audio features = pl.scan csv(path to audio features ). drop(
[ duration ms’, updated on’]
)
# Merge/Join the columns to create one large table
df = df track. join ( df artist track , on=’track id, how=’left’)
df = df.join( df artist , on=’ artist id , how=’left’)
df = df. join( df release , on=’release id , how=’left’)
df = df.join( df audio features , on=’isrc , how=’left’)
# Drop last unnecessary columns
df = df.drop(
[ isrc , track number’, release id , disc number’,
preview url , artist id , upc’]
). collect ()
Listing 62: Polars Out Of Core Data Ingestion All Code
# Turn miliseconds into seconds
df = df.with columns(
duration = pl.col( duration ms’) 10∗∗−3
).drop(’duration ms’)
# Turn string into boolean
df = df.with columns(
explicit = pl.col( explicit ). replace strict ({ t : True, f : False})
)
# Fill dates and turn to datetime
df = df.with columns(
release date = pl.when(
pl . col( release date ). str .count matches(’’) == 0
).then(
pl . col( release date ) + 0101 00:00:00’
).otherwise(
pl .when(
pl . col( release date ). str .count matches(’’) == 1
).then(
pl . col( release date ) + 01 00:00:01’
).otherwise(
pl . col( release date ) + 00:00:02
)
). str .to datetime(%Y%m%d %H:%M:%S’)
)
# Find all artist names for tracks
and combine them then join it with the original dataframe
df = df.group by(’track id ). agg(
artist name all = pl.col (artist name)
).with columns(
artist name all = pl.col( artist name all ). list . join(separator=’, )
). join(
df, on=’track id’
). collect ()
Listing 63: Polars Out Of Core Data Wrangling All Code
56
Mathias Asplin Big data analytics and processing
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/PolarsOutOfCore.csv’
path pyarrow = ’path/to/input/PolarsOutOfCore.parquet’
Listing 64: Polars Out Of Core Data I/O Paths Code
# Write to csv
df. sink csv(path csv, separator=’;)
Listing 65: Polars Out Of Core Data I/O Write CSV Code
# Write to parquet (rust)
df.sink parquet(path parquet)
Listing 66: Polars Out Of Core Data I/O Write Parquet Code
# Lazy read from csv
df = pl.scan csv(path csv, separator=’; ). collect ()
Listing 67: Polars Out Of Core Data I/O Read CSV Code
# Lazy read from parquet
df = pl.scan parquet(path parquet).collect ()
Listing 68: Polars Out Of Core Data I/O Read Parquet Code
# Lazy read from parquet
path to parquet = ’path/to/parquet/.parquet’
df = pl.scan parquet(path to parquet)
Listing 69: Polars Out Of Core Data Analytics/Exploration Lazy Read Parquet Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
df. filter (
pl. col( artist name all ). str .contains any( artists )) &
(pl . col( release date ) >= datetime(2010, 1, 1)) &
(pl . col( release date ) <datetime(2020, 1, 1)) &
pl. col( explicit )) &
(pl . col( duration) >= 360) &
((pl . col( danceability ) >= 0.8) |( pl . col ( energy’) >= 0.8)) &
(pl . col( tempo’) >= pl.col(’tempo’).mean()) &
(pl . col( liveness ) <= 0.2)
). sort(popularity , descending=True).head(10).collect()
Listing 70: Polars Out Of Core Data Analytics/Exploration Task 1 Code
# Find the artists who has made the most tracks
df .group by(’artist name’ ). len (). sort ( len , descending=True).collect()
Listing 71: Polars Out Of Core Data Analytics/Exploration Task 2 Code
# Find the artist that has made the most popular songs on average
df .group by(’artist name’ ). agg(
most popular = pl.col(popularity ). mean()
). sort( most popular’, descending=True).collect()
Listing 72: Polars Out Of Core Data Analytics/Exploration Task 3 Code
57
Mathias Asplin Big data analytics and processing
# Find the artist who on average makes the most danceable tracks
df .group by(’artist name’ ). agg(
best dance = pl.col( danceability ). mean()
). sort( best dance’, descending=True).collect()
Listing 73: Polars Out Of Core Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
df .group by(’artist name’ ). agg(
long song = pl.col (duration ). mean()
). sort( long song , descending=True).collect()
Listing 74: Polars Out Of Core Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
df .group by(’artist name’ ). agg(
happy song = pl.col(’valence ). mean()
). sort( happy song’, descending=True).collect()
Listing 75: Polars Out Of Core Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
df .group by(’artist name’ ). agg(
collaborations = pl.col(label name’).unique().len ()
). sort( collaborations , descending=True).collect()
Listing 76: Polars Out Of Core Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
(
df. filter ( pl . col ( explicit )). group by(’artist name’)
.len (). sort ( len , descending=True).collect()
)
Listing 77: Polars Out Of Core Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
df.with columns(
year = pl.col ( release date ).dt.year()
). group by(’year’ ). agg(
most popular year = pl.col(popularity ). mean()
). sort(
most popular year’, descending=True
). collect ()
Listing 78: Polars Out Of Core Data Analytics/Exploration Task 9 Code
# Find the artist with the most popular tracks on average per year
df.with columns(
year = pl.col ( release date ).dt.year()
).group by([’year , artist name ]). agg(
most popular artist year = pl.col( popularity ).mean()
). sort(
most popular artist year , descending=True
).unique(
year , keep=’first
). sort ( most popular artist year , descending=True).collect()
Listing 79: Polars Out Of Core Data Analytics/Exploration Task 10 Code
# Find the artist with the most track credits
df.unique(’track id ). select (
58
Mathias Asplin Big data analytics and processing
pl . col ( track title ),
artist credits = pl.col( artist name all ). str .count matches(’,) + 1
). sort( artist credits , descending=True).head(10).collect()
Listing 80: Polars Out Of Core Data Analytics/Exploration Task 11 Code
# Find the artist most active throughout the years
df .group by(’artist name’ ). agg(
pl . col( release date ). max() pl.col( release date ). min()
). sort( release date , descending=True).collect()
Listing 81: Polars Out Of Core Data Analytics/Exploration Task 12 Code
# Find the album type distribution
(
df.group by(’album type’).len().with columns(
(pl . col( len ) / pl .sum(’len )). alias ( percent’)
). collect ()
)
Listing 82: Polars Out Of Core Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
df.group by(’album type’).agg(pl.col( popularity ). mean()).collect ()
Listing 83: Polars Out Of Core Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
df. select (
pl .mean(’danceability’, energy’, liveness ,
speechiness , tempo’, valence , duration)
). collect ()
Listing 84: Polars Out Of Core Data Analytics/Exploration Task 15 Code
# Find the median of different measurements
df. select (
pl .median(’danceability’ , energy’, liveness ,
speechiness , tempo’, valence , duration)
). collect ()
Listing 85: Polars Out Of Core Data Analytics/Exploration Task 16 Code
# Find the standard deviation of different measurements
df. select (
pl .std( danceability ),
pl . std( energy’ ),
pl . std( liveness ),
pl . std( speechiness ),
pl . std( tempo’),
pl . std( valence ),
pl . std( duration’)
). collect ()
Listing 86: Polars Out Of Core Data Analytics/Exploration Task 17 Code
# Find out what makes a track popular
df.group by(’popularity’ ). agg(
pl . col( danceability ). mean(),
pl . col ( energy ). mean(),
pl . col ( liveness ). mean(),
pl . col ( speechiness ). mean(),
pl . col ( tempo’).mean(),
59
Mathias Asplin Big data analytics and processing
pl . col ( valence ). mean(),
pl . col ( duration ). mean()
). sort(popularity , descending=True).collect()
Listing 87: Polars Out Of Core Data Analytics/Exploration Task 18 Code
60
Mathias Asplin Big data analytics and processing
E Appendix Duckdb Code
# Connect to the database
con = duckdb.connect()
Listing 88: Duckdb Data Ingestion Read Code
engine = [python’, c , pyarrow’ ][0]
dtype backend = [’numpy nullable’, ’pyarrow’][0]
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/ sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
con.sql (f”””
CREATE TABLE track AS
SELECT EXCLUDE (updated on) FROM read csv(’{path to sp track}’);
CREATE TABLE artist track AS
SELECT EXCLUDE (updated on) FROM read csv(’{path to sp artist track}’);
CREATE TABLE artist AS
SELECT EXCLUDE (updated on) FROM read csv(’{path to sp artist}’);
CREATE TABLE release AS
SELECT EXCLUDE (release img, updated on) FROM
read csv(’{path to sp release }’);
CREATE TABLE audio features AS
SELECT EXCLUDE (duration ms, updated on) FROM
read csv(’{path to audio features }’);
”””)
Listing 89: Duckdb Data Ingestion Read Code
# Merge/Join the columns to create one large table
con.sql (f”””
CREATE TABLE df AS
(
SELECT FROM track
LEFT JOIN artist track ON track.track id = artist track.track id
LEFT JOIN artist ON artist track.artist id = artist . artist id
LEFT JOIN release ON track.release id = release.release id
LEFT JOIN audio features ON track.isrc = audio features.isrc
);
ALTER TABLE df DROP COLUMN track id 1;
ALTER TABLE df DROP COLUMN artist id 1;
ALTER TABLE df DROP COLUMN release id 1;
ALTER TABLE df DROP COLUMN isrc 1;
ALTER TABLE df DROP COLUMN isrc;
ALTER TABLE df DROP COLUMN track number;
ALTER TABLE df DROP COLUMN release id;
ALTER TABLE df DROP COLUMN disc number;
ALTER TABLE df DROP COLUMN preview url;
ALTER TABLE df DROP COLUMN artist id;
ALTER TABLE df DROP COLUMN upc;
”””)
Listing 90: Duckdb Data Ingestion Merge/Join Code
# Turn miliseconds into seconds
con.sql (f”””
ALTER TABLE df RENAME duration ms TO duration;
ALTER TABLE df ALTER duration TYPE DOUBLE;
UPDATE df SET duration = duration 0.001;
”””)
Listing 91: Duckdb Data Wrangling Duration Code
61
Mathias Asplin Big data analytics and processing
# Turn string into boolean
# Already interpreted as boolean
Listing 92: Duckdb Data Wrangling Boolean Code
# Fill dates and turn to datetime
con.sql (f”””
CREATE OR REPLACE TABLE df AS
(SELECT , (LENGTH(release date)
LENGTH(replace(release date, ’, ’’)))/LENGTH(’’) AS
Count FROM df);
CREATE OR REPLACE TABLE df AS
(SELECT , CASE WHEN Count = 0 THEN concat(release date, 0101 00:00:00’)
WHEN Count = 1 THEN concat(release date, 01 00:00:01’) ELSE
concat(release date , 00:00:02’) END AS release date new FROM df);
ALTER TABLE df ALTER release date new TYPE DATETIME;
ALTER TABLE df DROP COLUMN release date;
ALTER TABLE df DROP COLUMN Count;
ALTER TABLE df RENAME release date new TO release date;
”””)
Listing 93: Duckdb Data Wrangling Datetime Code
# Find all artist names for tracks
# and combine them then join it with the original dataframe
con.sql (f”””
CREATE OR REPLACE TEMP VIEW temp AS
SELECT track id, list string agg( list (artist name)) AS
artist name all FROM df
GROUP BY track id;
CREATE OR REPLACE TABLE df AS
SELECT FROM df
LEFT JOIN
temp ON df.track id = temp.track id;
ALTER TABLE df DROP COLUMN track id 1;
DROP VIEW temp
”””)
Listing 94: Duckdb Data Wrangling Artists Code
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/Duckdb.csv’
path pyarrow = ’path/to/input/Duckdb pa.parquet’
Listing 95: Duckdb Data I/O Paths Code
# Write to CSV
con.sql (f”””
COPY df TO {path csv} (HEADER, DELIMITER ’;’)
”””)
Listing 96: Duckdb Data I/O Write CSV Code
# Write to parquet (pyarrow)
con.sql (f”””
COPY df TO {path pyarrow} (COMPRESSION ’snappy’)
”””)
Listing 97: Duckdb Data I/O Write Parquet Code
# Read from csv
con.sql (f”””
CREATE OR REPLACE TABLE df AS
62
Mathias Asplin Big data analytics and processing
SELECT FROM read csv(’{path csv}’, DELIM = ’;’, HEADER = true)
”””)
Listing 98: Duckdb Data I/O Read CSV Code
# Read from parquet (pyarrow)
con.sql (f”””
CREATE OR REPLACE TABLE df AS
SELECT FROM read parquet(’{path pyarrow}’)
”””)
Listing 99: Duckdb Data I/O Read Parquet Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
print(
con.sql (f”””
SELECT
FROM
df
WHERE
NOT regexp matches(artist name all, {(’|’). join ( artists )}’)
AND release date >= ’201011’
AND release date <’202011’
AND explicit
AND duration >= 360
AND (danceability >= 0.8 OR energy >= 0.8)
AND tempo >= (SELECT mean(tempo) FROM df)
AND liveness <= 0.2
ORDER BY
popularity DESC
LIMIT 10
”””).show(max width=125)
)
Listing 100: Duckdb Data Analytics/Exploration Task 1 Code
# Find the artists who has made the most tracks
print(
con.sql (f”””
SELECT
artist name, COUNT() AS Count
FROM
df
GROUP BY
artist name
ORDER BY
Count DESC
”””)
)
Listing 101: Duckdb Data Analytics/Exploration Task 2 Code
# Find the artist that has made the most popular songs on average
print(
con.sql (f”””
SELECT
artist name, mean(popularity) AS Mean
FROM
df
GROUP BY
63
Mathias Asplin Big data analytics and processing
artist name
ORDER BY
Mean DESC
”””)
)
Listing 102: Duckdb Data Analytics/Exploration Task 3 Code
# Find the artist who on average makes the most danceable tracks
print(
con.sql (f”””
SELECT
artist name, mean(danceability) AS Mean
FROM
df
GROUP BY
artist name
ORDER BY
Mean DESC
”””)
)
Listing 103: Duckdb Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
print(
con.sql (f”””
SELECT
artist name, mean(duration) AS Mean
FROM
df
GROUP BY
artist name
ORDER BY
Mean DESC
”””)
)
Listing 104: Duckdb Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
print(
con.sql (f”””
SELECT
artist name, mean(valence) AS Mean
FROM
df
GROUP BY
artist name
ORDER BY
Mean DESC
”””)
)
Listing 105: Duckdb Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
print(
con.sql (f”””
SELECT
artist name, COUNT(DISTINCT label name) AS Collaborations
FROM
df
GROUP BY
artist name
ORDER BY
64
Mathias Asplin Big data analytics and processing
Collaborations DESC
”””)
)
Listing 106: Duckdb Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
print(
con.sql (f”””
SELECT
artist name, COUNT() AS Profanity
FROM
df
WHERE
explicit
GROUP BY
artist name
ORDER BY
Profanity DESC
”””)
)
Listing 107: Duckdb Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
print(
con.sql (f”””
SELECT
date part (’year ’, release date ) AS Year, mean(popularity) AS Mean
FROM
df
GROUP BY
date part (’year ’, release date )
ORDER BY
Mean DESC
”””)
)
Listing 108: Duckdb Data Analytics/Exploration Task 9 Code
# Find the artist with the most popular tracks on average per year
print(
con.sql (f”””
SELECT
Year, artist name, Popularity
FROM
(SELECT
, ROW NUMBER() OVER (PARTITION BY Year
ORDER BY Popularity DESC) AS RN
FROM
(SELECT
date part (’year ’, release date ) AS
Year, artist name, mean(popularity) AS Popularity
FROM
df
GROUP BY
date part (’year ’, release date ), artist name
ORDER BY
Popularity DESC))
WHERE
RN = 1
ORDER BY
Popularity DESC
”””)
)
Listing 109: Duckdb Data Analytics/Exploration Task 10 Code
65
Mathias Asplin Big data analytics and processing
# Find the artist with the most track credits
print(
con.sql (f”””
SELECT
track title , (LENGTH(artist name all)
LENGTH(replace(artist name all, ’,’, ’’)))/ LENGTH(’,’) + 1 AS Count
FROM
(SELECT
, ROW NUMBER() OVER (PARTITION BY track id ORDER BY
track id DESC) AS RN
FROM
df)
WHERE
RN = 1
ORDER BY
Count DESC
LIMIT 10
”””).show(max width=125)
)
Listing 110: Duckdb Data Analytics/Exploration Task 11 Code
# Find the artist most active throughout the years
print(
con.sql (f”””
SELECT
artist name, (max(release date) min(release date)) AS Longest
FROM
df
GROUP BY
artist name
ORDER BY
Longest DESC
”””)
)
Listing 111: Duckdb Data Analytics/Exploration Task 12 Code
# Find the album type distribution
print(
con.sql (f”””
SELECT
album type, Count/(SELECT COUNT() FROM df) AS Percentage
FROM
(SELECT
album type, COUNT() AS Count
FROM
df
GROUP BY
album type)
”””)
)
Listing 112: Duckdb Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
print(
con.sql (f”””
SELECT
album type, mean(popularity) AS Popularity
FROM
df
GROUP BY
album type
”””)
)
66
Mathias Asplin Big data analytics and processing
Listing 113: Duckdb Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
print(
con.sql (f”””
SELECT
round(mean(danceability), 6) AS Danceability,
round(mean(energy), 6) AS Energy,
round(mean(liveness), 6) AS Liveness,
round(mean(speechiness), 6) AS Speechiness,
round(mean(tempo), 6) AS Tempo,
round(mean(valence), 6) AS Valence,
round(mean(duration), 6) AS Duration
FROM
df
”””)
)
Listing 114: Duckdb Data Analytics/Exploration Task 15 Code
# Find the median of different measurements
print(
con.sql (f”””
SELECT
round(median(danceability), 6) AS Danceability,
round(median(energy), 6) AS Energy,
round(median(liveness), 6) AS Liveness,
round(median(speechiness), 6) AS Speechiness,
round(median(tempo), 6) AS Tempo,
round(median(valence), 6) AS Valence,
round(median(duration), 6) AS Duration
FROM
df
”””)
)
Listing 115: Duckdb Data Analytics/Exploration Task 16 Code
# Find the standard deviation of different measurements
print(
con.sql (f”””
SELECT
round(stddev(danceability), 6) AS Danceability,
round(stddev(energy), 6) AS Energy,
round(stddev(liveness), 6) AS Liveness,
round(stddev(speechiness), 6) AS Speechiness,
round(stddev(tempo), 6) AS Tempo,
round(stddev(valence), 6) AS Valence,
round(stddev(duration), 6) AS Duration
FROM
df
”””)
)
Listing 116: Duckdb Data Analytics/Exploration Task 17 Code
# Find out what makes a track popular
print(
con.sql (f”””
SELECT
popularity,
round(mean(danceability), 6) AS Danceability,
round(mean(energy), 6) AS Energy,
round(mean(liveness), 6) AS Liveness,
round(mean(speechiness), 6) AS Speechiness,
67
Mathias Asplin Big data analytics and processing
round(mean(tempo), 6) AS Tempo,
round(mean(valence), 6) AS Valence,
round(mean(duration), 6) AS Duration
FROM
df
GROUP BY
popularity
ORDER BY
popularity DESC
”””)
)
Listing 117: Duckdb Data Analytics/Exploration Task 18 Code
68
Mathias Asplin Big data analytics and processing
F Appendix Duckdb View Code
# Connect to the database
con = duckdb.connect()
Listing 118: Duckdb Data Ingestion Read Code
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/ sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
# View does not support dropping columns
# So this is not included in the tests
Listing 119: Duckdb View Data Ingestion Read Code
# Merge/Join the columns to create one large table
# View does not support dropping columns
# So this is not included in the tests
Listing 120: Duckdb View Data Ingestion Merge/Join Code
# Turn miliseconds into seconds
# View does not support dropping columns
# So this is not included in the tests
Listing 121: Duckdb View Data Wrangling Duration Code
# Turn string into boolean
# View does not support dropping columns
# So this is not included in the tests
Listing 122: Duckdb View Data Wrangling Boolean Code
# Fill dates and turn to datetime
# View does not support dropping columns
# So this is not included in the tests
Listing 123: Duckdb View Data Wrangling Datetime Code
# Find all artist names for tracks
# and combine them then join it with the original dataframe
# View does not support dropping columns
# So this is not included in the tests
Listing 124: Duckdb View Data Wrangling Artists Code
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/DuckdbView.csv’
path pyarrow = ’path/to/input/DuckdbView pa.parquet’
Listing 125: Duckdb View Data I/O Paths Code
69
Mathias Asplin Big data analytics and processing
# Write to CSV
# View does not support dropping columns
# So this is not included in the tests
Listing 126: Duckdb View Data I/O Write CSV Code
# Write to parquet (pyarrow)
# View does not support dropping columns
# So this is not included in the tests
Listing 127: Duckdb View Data I/O Write Parquet Code
# Read from csv
# View does not support dropping columns
# So this is not included in the tests
Listing 128: Duckdb View Data I/O Read CSV Code
# Read from parquet (pyarrow)
# View does not support dropping columns
# So this is not included in the tests
Listing 129: Duckdb View Data I/O Read Parquet Code
# Lazy read from parquet
path to parquet = ’path/to/parquet/.parquet’
con.sql (f”””
CREATE OR REPLACE VIEW df AS
SELECT FROM read parquet({path to parquet})
”””)
Listing 130: Duckdb View Data Analytics/Exploration Lazy Read Parquet Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
print(
con.sql (f”””
SELECT
FROM
df
WHERE
NOT regexp matches(artist name all, {(’|’). join ( artists )}’)
AND release date >= ’201011’
AND release date <’202011’
AND explicit
AND duration >= 360
AND (danceability >= 0.8 OR energy >= 0.8)
AND tempo >= (SELECT mean(tempo) FROM df)
AND liveness <= 0.2
ORDER BY
popularity DESC
LIMIT 10
”””).show(max width=125)
)
Listing 131: Duckdb View Data Analytics/Exploration Task 1 Code
70
Mathias Asplin Big data analytics and processing
# Find the artists who has made the most tracks
print(
con.sql (f”””
SELECT
artist name, COUNT() AS Count
FROM
df
GROUP BY
artist name
ORDER BY
Count DESC
”””)
)
Listing 132: Duckdb View Data Analytics/Exploration Task 2 Code
# Find the artist that has made the most popular songs on average
print(
con.sql (f”””
SELECT
artist name, mean(popularity) AS Mean
FROM
df
GROUP BY
artist name
ORDER BY
Mean DESC
”””)
)
Listing 133: Duckdb View Data Analytics/Exploration Task 3 Code
# Find the artist who on average makes the most danceable tracks
print(
con.sql (f”””
SELECT
artist name, mean(danceability) AS Mean
FROM
df
GROUP BY
artist name
ORDER BY
Mean DESC
”””)
)
Listing 134: Duckdb View Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
print(
con.sql (f”””
SELECT
artist name, mean(duration) AS Mean
FROM
df
GROUP BY
artist name
ORDER BY
Mean DESC
”””)
)
Listing 135: Duckdb View Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
print(
71
Mathias Asplin Big data analytics and processing
con.sql (f”””
SELECT
artist name, mean(valence) AS Mean
FROM
df
GROUP BY
artist name
ORDER BY
Mean DESC
”””)
)
Listing 136: Duckdb View Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
print(
con.sql (f”””
SELECT
artist name, COUNT(DISTINCT label name) AS Collaborations
FROM
df
GROUP BY
artist name
ORDER BY
Collaborations DESC
”””)
)
Listing 137: Duckdb View Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
print(
con.sql (f”””
SELECT
artist name, COUNT() AS Profanity
FROM
df
WHERE
explicit
GROUP BY
artist name
ORDER BY
Profanity DESC
”””)
)
Listing 138: Duckdb View Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
print(
con.sql (f”””
SELECT
date part (’year ’, release date ) AS Year, mean(popularity) AS Mean
FROM
df
GROUP BY
date part (’year ’, release date )
ORDER BY
Mean DESC
”””)
)
Listing 139: Duckdb View Data Analytics/Exploration Task 9 Code
# Find the artist with the most popular tracks on average per year
df [ year ] = df. release date . dt.year
72
Mathias Asplin Big data analytics and processing
print(
con.sql (f”””
SELECT
Year, artist name, Popularity
FROM
(SELECT
, ROW NUMBER() OVER (PARTITION BY Year
ORDER BY Popularity DESC) AS RN
FROM
(SELECT
date part (’year ’, release date ) AS
Year, artist name, mean(popularity) AS Popularity
FROM
df
GROUP BY
date part (’year ’, release date ), artist name
ORDER BY
Popularity DESC))
WHERE
RN = 1
ORDER BY
Popularity DESC
”””)
)
Listing 140: Duckdb View Data Analytics/Exploration Task 10 Code
# Find the artist with the most track credits
print(
con.sql (f”””
SELECT
track title , (LENGTH(artist name all)
LENGTH(replace(artist name all, ’,’, ’’)))/ LENGTH(’,’) + 1 AS Count
FROM
(SELECT
, ROW NUMBER() OVER (PARTITION BY track id
ORDER BY track id DESC) AS RN
FROM
df)
WHERE
RN = 1
ORDER BY
Count DESC
LIMIT 10
”””).show(max width=125)
)
Listing 141: Duckdb View Data Analytics/Exploration Task 11 Code
# Find the artist most active throughout the years
print(
con.sql (f”””
SELECT
artist name, (max(release date) min(release date)) AS Longest
FROM
df
GROUP BY
artist name
ORDER BY
Longest DESC
”””)
)
Listing 142: Duckdb View Data Analytics/Exploration Task 12 Code
# Find the album type distribution
print(
73
Mathias Asplin Big data analytics and processing
con.sql (f”””
SELECT
album type, Count/(SELECT COUNT() FROM df) AS Percentage
FROM
(SELECT
album type, COUNT() AS Count
FROM
df
GROUP BY
album type)
”””)
)
Listing 143: Duckdb View Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
print(
con.sql (f”””
SELECT
album type, mean(popularity) AS Popularity
FROM
df
GROUP BY
album type
”””)
)
Listing 144: Duckdb View Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
print(
con.sql (f”””
SELECT
round(mean(danceability), 6) AS Danceability,
round(mean(energy), 6) AS Energy,
round(mean(liveness), 6) AS Liveness,
round(mean(speechiness), 6) AS Speechiness,
round(mean(tempo), 6) AS Tempo,
round(mean(valence), 6) AS Valence,
round(mean(duration), 6) AS Duration
FROM
df
”””)
)
Listing 145: Duckdb View Data Analytics/Exploration Task 15 Code
# Find the median of different measurements
print(
con.sql (f”””
SELECT
round(median(danceability), 6) AS Danceability,
round(median(energy), 6) AS Energy,
round(median(liveness), 6) AS Liveness,
round(median(speechiness), 6) AS Speechiness,
round(median(tempo), 6) AS Tempo,
round(median(valence), 6) AS Valence,
round(median(duration), 6) AS Duration
FROM
df
”””)
)
Listing 146: Duckdb View Data Analytics/Exploration Task 16 Code
# Find the standard deviation of different measurements
74
Mathias Asplin Big data analytics and processing
print(
con.sql (f”””
SELECT
round(stddev(danceability), 6) AS Danceability,
round(stddev(energy), 6) AS Energy,
round(stddev(liveness), 6) AS Liveness,
round(stddev(speechiness), 6) AS Speechiness,
round(stddev(tempo), 6) AS Tempo,
round(stddev(valence), 6) AS Valence,
round(stddev(duration), 6) AS Duration
FROM
df
”””)
)
Listing 147: Duckdb View Data Analytics/Exploration Task 17 Code
# Find out what makes a track popular
print(
con.sql (f”””
SELECT
popularity,
round(mean(danceability), 6) AS Danceability,
round(mean(energy), 6) AS Energy,
round(mean(liveness), 6) AS Liveness,
round(mean(speechiness), 6) AS Speechiness,
round(mean(tempo), 6) AS Tempo,
round(mean(valence), 6) AS Valence,
round(mean(duration), 6) AS Duration
FROM
df
GROUP BY
popularity
ORDER BY
popularity DESC
”””)
)
Listing 148: Duckdb View Data Analytics/Exploration Task 18 Code
75
Mathias Asplin Big data analytics and processing
G Appendix Daft In Memory Code
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/ sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
df track = ft.read csv(
path to sp track
). exclude(updated on’). collect ()
df artist track = ft.read csv
( path to sp artist track
). exclude(updated on’). collect ()
df artist = ft.read csv(
path to sp artist
). exclude(updated on’). collect ()
df release = ft.read csv(
path to sp release
). exclude( release img , updated on’). collect ()
df audio features = ft.read csv(
path to audio features
). exclude(duration ms’, updated on’). collect ()
Listing 149: Daft In Memory Data Ingestion Read Code
# Merge/Join the columns to create one large table
df = df track. join ( df artist track , on=’track id, how=’left’). collect ()
df = df.join( df artist , on=’ artist id , how=’left’ ). collect ()
df = df. join( df release , on=’release id , how=’left’). collect ()
df = df.join( df audio features , on=’isrc , how=’left’). collect ()
# Drop last unnecessary columns
(
df = df.exclude( isrc , track number’, release id ,
disc number’, preview url , artist id , upc’). collect ()
)
Listing 150: Daft In Memory Data Ingestion Merge/Join Code
# Turn miliseconds into seconds
df = df.with columns(
{
duration: ft . col( duration ms’) 10∗∗−3
}
).exclude(duration ms’). collect ()
Listing 151: Daft In Memory Data Wrangling Duration Code
# Turn string into boolean
df = df.with columns(
{
explicit : ft . col( explicit ) == ’t’
}
). collect ()
Listing 152: Daft In Memory Data Wrangling Boolean Code
# Fill dates and turn to datetime
df = df.with columns(
{
release date : ( ft . col( release date ). str .count matches(’’) == 0)
. if else ( ft . col( release date ) + 0101 00:00:00’,
( ft . col( release date ). str .count matches(’’) == 1)
76
Mathias Asplin Big data analytics and processing
. if else ( ft . col( release date ) + 01 00:00:01’,
ft . col( release date ) + 00:00:02 ))
}
)
df = df.with columns(
{
release date : ft . col( release date ). str .to datetime(%Y%m%d %H:%M:%S’)
}
). collect ()
Listing 153: Daft In Memory Data Wrangling Datetime Code
# Find all artist names for tracks
# and combine them then join it with the original dataframe
df = df.groupby(’track id ). agg list (
ft . col (artist name ). alias ( artist name all )
).with columns(
{
artist name all : ft . col( artist name all ). list . join( delimiter=’,)
}
). join(
df, on=’track id’
). collect ()
Listing 154: Daft In Memory Data Wrangling Artists Code
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/DaftInMemory.csv’
path pyarrow = ’path/to/input/DaftInMemory pa.parquet’
Listing 155: Daft In Memory Data I/O Paths Code
# Write to CSV
df. write csv(path csv)
Listing 156: Daft In Memory Data I/O Write CSV Code
# Write to parquet (pyarrow)
df.write parquet(path pyarrow, compression=’snappy’)
Listing 157: Daft In Memory Data I/O Write Parquet Code
# Read from csv
df = ft.read csv(path csv, delimiter=’, ). collect ()
Listing 158: Daft In Memory Data I/O Read CSV Code
# Read from parquet (pyarrow)
df = ft.read parquet(path pyarrow).collect()
Listing 159: Daft In Memory Data I/O Read Parquet Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
# Since Daft is lazy only it is only tested for the lazy version .
Listing 160: Daft In Memory Data Analytics/Exploration Task 1 Code
77
Mathias Asplin Big data analytics and processing
# Find the artists who has made the most tracks
# Since Daft is lazy only it is only tested for the lazy version .
Listing 161: Daft In Memory Data Analytics/Exploration Task 2 Code
# Find the artist that has made the most popular songs on average
# Since Daft is lazy only it is only tested for the lazy version .
Listing 162: Daft In Memory Data Analytics/Exploration Task 3 Code
# Find the artist who on average makes the most danceable tracks
# Since Daft is lazy only it is only tested for the lazy version .
Listing 163: Daft In Memory Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
# Since Daft is lazy only it is only tested for the lazy version .
Listing 164: Daft In Memory Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
# Since Daft is lazy only it is only tested for the lazy version .
Listing 165: Daft In Memory Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
# Since Daft is lazy only it is only tested for the lazy version .
Listing 166: Daft In Memory Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
# Since Daft is lazy only it is only tested for the lazy version .
Listing 167: Daft In Memory Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
# Since Daft is lazy only it is only tested for the lazy version .
Listing 168: Daft In Memory Data Analytics/Exploration Task 9 Code
# Find the artist with the most popular tracks on average per year
df [ year ] = df. release date . dt.year
# Since Daft is lazy only it is only tested for the lazy version .
Listing 169: Daft In Memory Data Analytics/Exploration Task 10 Code
# Find the artist with the most track credits
# Since Daft is lazy only it is only tested for the lazy version .
Listing 170: Daft In Memory Data Analytics/Exploration Task 11 Code
78
Mathias Asplin Big data analytics and processing
# Find the artist most active throughout the years
# Since Daft is lazy only it is only tested for the lazy version .
Listing 171: Daft In Memory Data Analytics/Exploration Task 12 Code
# Find the album type distribution
# Since Daft is lazy only it is only tested for the lazy version .
Listing 172: Daft In Memory Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
# Since Daft is lazy only it is only tested for the lazy version .
Listing 173: Daft In Memory Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
# Since Daft is lazy only it is only tested for the lazy version .
Listing 174: Daft In Memory Data Analytics/Exploration Task 15 Code
# Find the median of different measurements
# Since Daft is lazy only it is only tested for the lazy version .
Listing 175: Daft In Memory Data Analytics/Exploration Task 16 Code
# Find the standard deviation of different measurements
# Since Daft is lazy only it is only tested for the lazy version .
Listing 176: Daft In Memory Data Analytics/Exploration Task 17 Code
# Find out what makes a track popular
# Since Daft is lazy only it is only tested for the lazy version .
Listing 177: Daft In Memory Data Analytics/Exploration Task 18 Code
79
Mathias Asplin Big data analytics and processing
H Appendix Daft Out Of Core Code
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/ sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
df track = ft.read csv(
path to sp track
).exclude(updated on’)
df artist track = ft.read csv(
path to sp artist track
).exclude(updated on’)
df artist = ft.read csv(
path to sp artist
).exclude(updated on’)
df release = ft.read csv(
path to sp release
). exclude( release img , updated on’)
df audio features = ft.read csv(
path to audio features
).exclude(duration ms’, updated on’)
# Merge/Join the columns to create one large table
df = df track. join ( df artist track , on=’track id, how=’left’)
df = df.join( df artist , on=’ artist id , how=’left’)
df = df. join( df release , on=’release id , how=’left’)
df = df.join( df audio features , on=’isrc , how=’left’)
# Drop last unnecessary columns
(
df = df.exclude( isrc , track number’, release id ,
disc number’, preview url , artist id , upc’). collect ()
)
Listing 178: Daft Out Of Core Data Ingestion All Code
# Turn miliseconds into seconds
df = df.with columns(
{
duration: ft . col( duration ms’) 10∗∗−3
}
).exclude(duration ms’)
# Turn string into boolean
df = df.with columns(
{
explicit : ft . col( explicit ) == ’t’
}
)
# Fill dates and turn to datetime
df = df.with columns(
{
release date : ( ft . col( release date ). str .count matches(’’) == 0)
. if else ( ft . col( release date ) + 0101 00:00:00’,
( ft . col( release date ). str .count matches(’’) == 1)
. if else ( ft . col( release date ) + 01 00:00:01’,
ft . col( release date ) + 00:00:02 ))
}
)
df = df.with columns(
{
release date : ft . col( release date ). str .to datetime(%Y%m%d %H:%M:%S’)
}
)
80
Mathias Asplin Big data analytics and processing
# Find all artist names for tracks
# and combine them then join it with the original dataframe
df = df.groupby(’track id ). agg list (
ft . col (artist name ). alias ( artist name all )
).with columns(
{
artist name all : ft . col( artist name all ). list . join( delimiter=’,)
}
). join(
df, on=’track id’
). collect ()
Listing 179: Daft Out Of Core Data Wrangling Duration Code
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/DaftOutOfCore’
path pyarrow = ’path/to/input/DaftOutOfCore pa’
Listing 180: Daft Out Of Core Data I/O Paths Code
# Write to CSV
df. write csv(path csv)
Listing 181: Daft Out Of Core Data I/O Write CSV Code
# Write to parquet (pyarrow)
df.write parquet(path pyarrow, compression=’snappy’)
Listing 182: Daft Out Of Core Data I/O Write Parquet Code
# Read from csv
df = ft.read csv(path csv, delimiter=’, ). collect ()
Listing 183: Daft Out Of Core Data I/O Read CSV Code
# Read from parquet (pyarrow)
df = ft.read parquet(f{path pyarrow}/.parquet’).collect ()
Listing 184: Daft Out Of Core Data I/O Read Parquet Code
# Lazy read from parquet
path to parquet = ’path/to/parquet/.parquet’
df = ft.read parquet(path to parquet)
Listing 185: Daft Out Of Core Data Analytics/Exploration Lazy Read Parquet Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
df. filter (
ft . col( artist name all ). str .contains(( | ). join ( artists ))) &
( ft . col( release date ) >= datetime(2010, 1, 1)) &
( ft . col( release date ) <datetime(2020, 1, 1)) &
ft . col( explicit )) &
( ft . col ( duration’) >= 360) &
(( ft . col ( danceability ) >= 0.8) |( ft . col (energy’) >= 0.8)) &
( ft . col ( tempo’) >= df.select(’tempo’).agg(
ft . col( tempo’).mean()
). collect (). to pylist ()[0][ tempo’]) &
81
Mathias Asplin Big data analytics and processing
( ft . col ( liveness ) <= 0.2)
). sort(popularity , desc=True).limit(10). collect ()
Listing 186: Daft Out Of Core Data Analytics/Exploration Task 1 Code
# Find the artists who has made the most tracks
df.groupby(’artist name’).agg(
ft . col( artist name ). count(). alias ( count’)
). sort( count’, desc=True).collect()
Listing 187: Daft Out Of Core Data Analytics/Exploration Task 2 Code
# Find the artist that has made the most popular songs on average
df.groupby(’artist name’).agg(
ft . col( popularity ). mean().alias( most popular’)
). sort( most popular’, desc=True).collect()
Listing 188: Daft Out Of Core Data Analytics/Exploration Task 3 Code
# Find the artist who on average makes the most danceable tracks
df.groupby(’artist name’).agg(
ft . col( danceability ). mean().alias( best dance’)
). sort( best dance’, desc=True).collect()
Listing 189: Daft Out Of Core Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
df.groupby(’artist name’).agg(
ft . col (duration ). mean().alias( long song)
). sort( long song , desc=True).collect()
Listing 190: Daft Out Of Core Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
df.groupby(’artist name’).agg(
ft . col( valence ). mean().alias( happy song’)
). sort( happy song’, desc=True).collect()
Listing 191: Daft Out Of Core Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
df.groupby(’artist name’).agg(
ft . col(label name’).approx count distinct (). alias ( collaborations )
). sort( collaborations , desc=True).collect()
Listing 192: Daft Out Of Core Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
df. filter ( ft . col( explicit )). groupby(’artist name’).agg(
ft . col( artist name ). count(). alias ( length )
). sort(length , desc=True).collect()
Listing 193: Daft Out Of Core Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
df.with columns(
{
year : ft . col( release date ). dt.year()
}
).groupby(’year’). agg(
ft . col( popularity ). mean().alias( most popular year’)
). sort(
82
Mathias Asplin Big data analytics and processing
most popular year’, desc=True
). collect ()
Listing 194: Daft Out Of Core Data Analytics/Exploration Task 9 Code
# Find the artist with the most popular tracks on average per year
df [ year ] = df. release date . dt.year
df.with columns(
{
year : ft . col( release date ). dt.year()
}
).groupby(’year’, artist name ). agg(
ft . col ( popularity ). mean().alias( most popular artist year )
). sort(
[ year , most popular artist year ], [ True, True]
). collect ()
Listing 195: Daft Out Of Core Data Analytics/Exploration Task 10 Code
# Find the artist with the most track credits
# Unique column subset is not yet implemented.
# https://github.com/EventualInc/Daft/issues/3181
df. select (
ft . col ( track title ),
( ft . col ( artist name all ). str .count matches(’,) + 1). alias ( artist credits )
). sort( artist credits , desc=True).collect()
Listing 196: Daft Out Of Core Data Analytics/Exploration Task 11 Code
# Find the artist most active throughout the years
df.groupby(’artist name’).agg(
ft . col( release date ). max().alias( max’),
ft . col( release date ). min(). alias ( min’)
).with columns(
{
diff : ft . col(max’) ft.col(min’)
}
). select (artist name , diff ). sort( diff , desc=True).collect()
Listing 197: Daft Out Of Core Data Analytics/Exploration Task 12 Code
# Find the album type distribution
df.groupby(’album type’).agg(
ft . col ( album type’).count(). alias ( len )
). select (
ft . col( album type’),
ft . col ( len ) / df . select ( album type’)
.count(). collect (). to pylist ()[0][ count’]
). collect ()
Listing 198: Daft Out Of Core Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
df.groupby(’album type’).agg(ft.col(popularity ). mean()).collect ()
Listing 199: Daft Out Of Core Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
(
df.mean(’danceability’, energy’, liveness ,
speechiness , tempo’, valence , duration ). collect ()
)
Listing 200: Daft Out Of Core Data Analytics/Exploration Task 15 Code
83
Mathias Asplin Big data analytics and processing
# Find the median of different measurements
# Not implemented
Listing 201: Daft Out Of Core Data Analytics/Exploration Task 16 Code
# Find the standard deviation of different measurements
(
df.stddev(’danceability , energy’, liveness ,
speechiness , tempo’, valence , duration ). collect ()
)
Listing 202: Daft Out Of Core Data Analytics/Exploration Task 17 Code
# Find out what makes a track popular
df.groupby(’popularity’). agg(
ft . col( danceability ). mean(),
ft . col ( energy’ ). mean(),
ft . col( liveness ). mean(),
ft . col( speechiness ). mean(),
ft . col( tempo’).mean(),
ft . col( valence ). mean(),
ft . col (duration ). mean()
). sort(popularity , desc=True).collect()
Listing 203: Daft Out Of Core Data Analytics/Exploration Task 18 Code
84
Mathias Asplin Big data analytics and processing
I Appendix Dask Code
path to sp track = ’path/to/sp track.csv’
path to sp artist track = ’path/to/ sp artist track .csv
path to sp artist = ’path/to/sp artist . csv
path to sp release = ’path/to/sp release.csv
path to audio features = ’path/to/audio features.csv
# Read in all of the data and drop unnecessary columns
df track = dd.read csv(
path to sp track, engine=’pyarrow’, dtype backend=’pyarrow’
).drop(’updated on’, axis=1)
df artist track = dd.read csv(
path to sp artist track , engine=’pyarrow’, dtype backend=’pyarrow’
).drop(’updated on’, axis=1)
df artist = dd.read csv(
path to sp artist , engine=’pyarrow’, dtype backend=’pyarrow’
).drop(’updated on’, axis=1)
df release = dd.read csv(
path to sp release , engine=’pyarrow’, dtype backend=’pyarrow’
). drop([ release img , updated on’], axis=1)
df audio features = dd.read csv(
path to audio features , engine=’pyarrow’, dtype backend=’pyarrow’
).drop([duration ms’, updated on’], axis=1)
# Merge/Join the columns to create one large table
df = df track.merge( df artist track , on=’track id, how=’left’)
df = df.merge( df artist , on=’artist id , how=’left’)
df = df.merge(df release , on=’release id , how=’left’)
df = df.merge(df audio features, on=’isrc , how=’left’)
# Drop last unnecessary columns
(
df = df.drop([ isrc , track number’, release id ,
disc number’, preview url , artist id , upc’ ], axis=1)
)
# Collect as a Pandas dataframe
df = df. repartition ( npartitions=1).compute()
Listing 204: Dask Data Ingestion All Code
df = dd.from pandas(df, npartitions=8)
# Turn miliseconds into seconds
df[ duration’] = df.duration ms 10∗∗−3
df = df.drop(’duration ms’, axis=1)
# Turn string into boolean
df[ explicit ] = df. explicit == ’t’
# Fill dates and turn to datetime
df[ release date ] = df. release date .apply(
lambda x: x + 0101 00:00:00’ if x.count(’’) == 0 else
x + 01 00:00:01’ if x.count(’’) == 1 else
x + 00:00:02 , meta=(’release date’, str ))
df[ release date ] = dd.to datetime(df.release date , format=’mixed’, errors=’coerce’)
# Find all artist names for tracks
# and combine them then join it with the original dataframe
df g = df.groupby(’track id )[ artist name ]. agg( list ). reset index ()
df g = df g.rename(columns={’artist name’: ’artist name all’})
df = df.merge(df g, on=’track id)
df[ artist name all ] = df. artist name all . str . join( , )
df = df.compute()
Listing 205: Dask Data Wrangling Duration Code
85
Mathias Asplin Big data analytics and processing
base path = ’path/to/files/.parquet’
path csv = ’path/to/output/Dask’
path pyarrow = ’path/to/input/Dask pa’
Listing 206: Dask Data I/O Paths Code
# Write to CSV
df . to csv(path csv, sep=’; )
Listing 207: Dask Data I/O Write CSV Code
# Write to parquet (pyarrow)
df.to parquet(path pyarrow, engine=’pyarrow’)
Listing 208: Dask Data I/O Write Parquet Code
# Read from csv
df = dd.read csv(f{path csv}/.part’, sep=’; , assume missing=True).compute()
Listing 209: Dask Data I/O Read CSV Code
# Read from parquet (pyarrow)
df = dd.read parquet(f’{path pyarrow}/.parquet’, engine=’pyarrow’).compute()
Listing 210: Dask Data I/O Read Parquet Code
# Lazy read from parquet
path to parquet = ’path/to/parquet/.parquet’
df = dd.read parquet(f’{path pyarrow}/.parquet’, engine=’pyarrow’)
Listing 211: Dask Data Analytics/Exploration Lazy Read Parquet Code
artists = [Travis Scott , Beyonce’, Juice WRLD’, ’Tyler, The Creator’]
# Find top 10 popular tracks except the artists above with tracks
# between 2010 and 2020 that does not contain profanity,
# is more than 3 minutes long, is good at dancing to or has good energy,
# with a higher tempo than mean, that is not performed live.
df[(˜df. artist name all . str .contains(( | ). join ( artists ), case=True, na=False)) &
(df. release date >= datetime(2010, 1, 1)) &
(df. release date <datetime(2020, 1, 1)) &
(˜df . explicit ) &
(df.duration >= 360) &
((df. danceability >= 0.8) |(df .energy >= 0.8)) &
(df.tempo >= df.tempo.mean()) &
(df. liveness <= 0.2)
]. sort values ( popularity , ascending=False).head(10)
Listing 212: Dask Data Analytics/Exploration Task 1 Code
# Find the artists who has made the most tracks
df.groupby(
artist name ). size (). reset index (). rename(columns={0: ’len’}
). sort values ( len , ascending=False).compute()
Listing 213: Dask Data Analytics/Exploration Task 2 Code
# Find the artist that has made the most popular songs on average
df.groupby(’artist name’)[ popularity ]. mean().reset index (). sort values (
popularity , ascending=False
).compute()
86
Mathias Asplin Big data analytics and processing
Listing 214: Dask Data Analytics/Exploration Task 3 Code
# Find the artist who on average makes the most danceable tracks
df.groupby(’artist name’)[ danceability ]. mean().reset index (). sort values (
danceability , ascending=False
).compute()
Listing 215: Dask Data Analytics/Exploration Task 4 Code
# Find the artist that makes the longest tracks on average
df.groupby(’artist name’)[ duration’ ]. mean().reset index (). sort values (
duration, ascending=False
).compute()
Listing 216: Dask Data Analytics/Exploration Task 5 Code
# Find the artist who makes the happiest tracks on average
df.groupby(’artist name’)[ valence ]. mean().reset index (). sort values (
valence , ascending=False
).compute()
Listing 217: Dask Data Analytics/Exploration Task 6 Code
# Find the artist that collaborates with most record labels on average
df .groupby(’artist name’ )[ label name’ ]. unique().apply(
len , meta=(’label name’, ’int64)
). reset index (). sort values (label name’, ascending=False).compute()
Listing 218: Dask Data Analytics/Exploration Task 7 Code
# Find the artist that has the most tracks with profanity
df[ df. explicit ]. groupby(’artist name’). size (). reset index (). rename(
columns={0: ’profanity’}
). sort values ( profanity , ascending=False).compute()
Listing 219: Dask Data Analytics/Exploration Task 8 Code
# Find the year with most popular tracks on average
df [ year ] = df. release date . dt.year
df.groupby(’year’)[ popularity ]. mean().reset index (). sort values (
popularity , ascending=False
).compute()
Listing 220: Dask Data Analytics/Exploration Task 9 Code
# Find the artist with the most popular tracks on average per year
df [ year ] = df. release date . dt.year
(
df.groupby([’year , artist name ])[ popularity ]
.mean()
. reset index ()
. sort values ( popularity , ascending=False)
.drop duplicates(year , keep=’first )
. sort values ( popularity , ascending=False)
).compute()
Listing 221: Dask Data Analytics/Exploration Task 10 Code
# Find the artist with the most track credits
(
87
Mathias Asplin Big data analytics and processing
df m = df.drop duplicates( track id )[ artist name all ]
.str .count(’ , ). reset index ()
)
df m = df m.rename(columns={’artist name all’: artist credits })
df temp = df.reset index (). merge(df m, on=’index’)[
[ track title , artist credits ]
]
df temp[ artist credits ] = df temp. artist credits + 1
df temp.sort values ( artist credits , ascending=False).head(10)
Listing 222: Dask Data Analytics/Exploration Task 11 Code
# Find the artist most active throughout the years
df.groupby(’artist name’)[ release date ]. apply(lambda x: x.max() x.min(),
meta=(’release date’, f8 )
). reset index (). sort values ( release date , ascending=False).compute()
Listing 223: Dask Data Analytics/Exploration Task 12 Code
# Find the album type distribution
df temp = df.groupby(’album type’).size(). reset index (). rename(
columns={0: ’percentage’}
)
df temp[percentage’] = (df temp[’percentage’] / df temp[’percentage’ ]. sum()) 100
df temp.compute()
Listing 224: Dask Data Analytics/Exploration Task 13 Code
# Find the album type that is the most popular on average
df.groupby(’album type’)[’popularity’ ]. mean().reset index ().compute()
Listing 225: Dask Data Analytics/Exploration Task 14 Code
# Find the mean of different measurements
(
df[
[ danceability , energy’ , liveness ,
speechiness , tempo’, valence , duration]
]. mean().reset index().rename(columns={0: ’mean’}).compute()
)
Listing 226: Dask Data Analytics/Exploration Task 15 Code
# Find the median of different measurements
(
df[
[ danceability , energy’ , liveness ,
speechiness , tempo’, valence , duration]
]. median approximate().reset index().rename(columns={0: ’median’}).compute()
)
Listing 227: Dask Data Analytics/Exploration Task 16 Code
# Find the standard deviation of different measurements
(
df[
[ danceability , energy’ , liveness ,
speechiness , tempo’, valence , duration]
]. std (). reset index (). rename(columns={0: ’std’}).compute()
)
Listing 228: Dask Data Analytics/Exploration Task 17 Code
88
Mathias Asplin Big data analytics and processing
# Find out what makes a track popular
(
df.groupby(’popularity’)[
[ danceability , energy’ , liveness ,
speechiness , tempo’, valence , duration]
]. mean()
. reset index (). rename(columns={0: ’popularity’}
). sort values ( popularity , ascending=False)
).compute()
Listing 229: Dask Data Analytics/Exploration Task 18 Code
89
Mathias Asplin Big data analytics and processing
J Appendix Graphs Code
import time
import duckdb
import daft as ft
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
import dask.dataframe as dd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce
def fill columns ( df tables ):
for index, table in enumerate(df tables):
col name = table.columns[0]. split ( )
for iin range(1, 19):
if f {col name[0]} {i} {col name[2]}not in list (table . columns):
table [ f {col name[0]} {i} {col name[2]}] = np.nan
col order = [f {col name[0]} {i} {col name[2]}for iin range(1, 19)]
df tables [index] = table[ col order ]
return df tables
def fill libraries ( df tables , libraries ):
for index, table in enumerate(df tables):
for library in libraries :
if library not in table .index:
table . loc [ library ] = pd.Series(dtype=’float64’)
df tables [index] = table.sort index ()
return df tables
def transform dataframes( df list , libraries , values ):
df tables = [pd.pivot table(df, \
values=values, index=’Library’, columns=’Operation’)
for df in df list ]
if len( df tables ) >1:
df tables = fill columns ( df tables )
df tables = fill libraries ( df tables , libraries )
return df tables
schema = {
Library: str ,
Operation’: str ,
Category’: str ,
WallTime’: float ,
CpuTime’: float
}
names = list(schema.keys())
df tables = []
# Data ingestion
df dataingestion = pd.read csv( \
DataIngestion.csv, sep=’; , names=names, dtype=schema, header=None).round(2)
df dataingestion [ Ratio’] = df dataingestion.CpuTime / df dataingestion.WallTime
df tables += transform dataframes( \
[ df dataingestion ], df dataingestion .Library.unique(), WallTime’)
# Data wrangling
df datawrangling = pd.read csv( \
DataWrangling.csv’, sep=’;’, names=names, dtype=schema, header=None).round(2)
df datawrangling[Ratio’] = df datawrangling.CpuTime / df datawrangling.WallTime
df tables += transform dataframes( \
[df datawrangling], df datawrangling.Library.unique(), WallTime’)
# Data io
90
Mathias Asplin Big data analytics and processing
df dataio = pd.read csv( \
DataIO.csv’, sep=’; , names=names, dtype=schema, header=None).round(2)
df dataio [ Ratio’] = df dataio.CpuTime / df dataio.WallTime
df tables += transform dataframes( \
[ df dataio ], df dataio .Library.unique(), WallTime’)
# Data exploration
df dataexploration = pd.read csv( \
DataExploration.csv’, sep=’; , names=names, dtype=schema, header=None).round(2)
df dataexploration [ Ratio’] = df dataexploration.CpuTime / df dataexploration.WallTime
df bigdata = pd.read csv(’BigData.csv’, \
sep=’; , names=names, dtype=schema, header=None).round(2)
df bigdata[ Ratio] = df bigdata.CpuTime / df bigdata.WallTime
df medium = df bigdata[df bigdata.Operation.str.contains(’medium’, case=False, regex=True)].copy()
df large = df bigdata[(df bigdata.Operation.str.contains( large , case=False, regex=True)) &
(˜df bigdata.Operation.str.contains( very’ , case=False, regex=True))].copy()
df verylarge = df bigdata[df bigdata.Operation.str. contains( verylarge , case=False, regex=True)].copy()
df many = df bigdata[df bigdata.Operation.str.contains(many’, case=False, regex=True)].copy()
df list = [df dataexploration, df medium, df large, df verylarge , df many]
df tables += transform dataframes(df list, df dataexploration.Library.unique(), WallTime’)
for iin range(8):
df tables [ i ] = df tables [ i ]. round(3)
df tables [ i ]. columns = df tables[i ]. columns.str. replace( , )
df tables [ i ]. index = df tables[ i ]. index. str .replace( , )
for iin range(3, 8):
df tables [ i ]. columns = [f’Task {i}for iin range(1, 19)]
df sub table = df tables [0]. loc [[ Pandas python’, ’Pandas c’, Pandas numpy’]]
df sub table = df sub table. fillna ( df sub table .min())
df dataio = df tables [0]. copy()
df dataio . loc [[ Pandas python’, ’Pandas c’, Pandas numpy’]] = df sub table
df read merge sum = df dataio.sum(axis=1).drop(
[ Pandas c’, Pandas python’]
). reset index (name = ’Read Merge’)
df data wrangling sum = df tables[1].sum(axis=1).reset index(name = ’Data Wrangling’)
df read write csv sum = df tables [2][
[ Read csv’, Write csv ]
]. sum(axis=1, min count=1).reset index(name = ’Read Write Csv’)
df read write parquet pa sum = df tables [2]. drop(
[ Polars outofcore ]
)[
[ Read pyarrow parquet’, ’Write pyarrow parquet’]
]. sum(axis=1, min count=1).reset index(name = ’R/W Parquet (PA)’)
df read write parquet r sum = df tables [2]. loc [[ Polars inmem’, ’Polars outofcore ]][
[ Read rust parquet’, Write rust parquet’]
]. sum(axis=1).reset index(name = ’R/W Parquet (R)’)
df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),
[
df read merge sum,
df data wrangling sum,
df read write csv sum,
df read write parquet pa sum,
df read write parquet r sum
]). sort values (by = ’Library’)
df tables .append(df merged.set index(’Library’))
df merged = df merged.set index(’Library’)
df largest = df merged.apply(pd.Series.nsmallest, n=1, axis=0).reindex(df merged.index, axis=0)
df merged = df merged.reset index(’Library’)
df largest df largest . isnull ()] = 1
df tables .append(df largest)
df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),
91
Mathias Asplin Big data analytics and processing
[
df tables [3]. dropna(how=’all’, axis=0).dropna(axis=1) \
.sum(axis=1).reset index(name=’Small’),
df tables [4]. dropna(how=’all’, axis=0).dropna(axis=1) \
.sum(axis=1).reset index(name=’Medium’),
df tables [5]. dropna(how=’all’, axis=0).dropna(axis=1) \
.sum(axis=1).reset index(name=’Large’),
df tables [6]. dropna(how=’all’, axis=0).dropna(axis=1) \
.sum(axis=1).reset index(name=’Verylarge’),
df tables [7]. dropna(how=’all’, axis=0).dropna(axis=1) \
.sum(axis=1).reset index(name=’Many’)
])
df merged = df merged.set index(’Library’).reindex( df tables [3]. index.unique(), axis=0) \
. reset index (names = ’Library’)
df merged = df merged.sort values(by = ’Library’)
df tables .append(df merged.set index(’Library’))
df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’index’ ], how=’outer’, sort=True),
[
df tables [3]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \
. reset index (name = ’Small’),
df tables [4]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \
. reset index (name = ’Medium’),
df tables [5]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \
. reset index (name = ’Large’),
df tables [6]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \
. reset index (name = ’Verylarge’),
df tables [7]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \
. reset index (name = ’Many’)
])
df merged = df merged.rename(columns={’index’: ’Library’})\
. set index( Library ). reindex( df tables [3]. index, axis=0)
df merged = df merged.sort values(by = ’Library’)
df tables .append(df merged)
schema = {
Library: str ,
Operation’: str ,
Category’: str ,
WallTime’: float ,
CpuTime’: float
}
names = list(schema.keys())
df ratio tables = []
# Data ingestion
df dataingestion = pd.read csv(’DataIngestion.csv’, \
sep=’; , names=names, dtype=schema, header=None).round(2)
df dataingestion [ Ratio’] = df dataingestion.CpuTime / df dataingestion.WallTime
df ratio tables += transform dataframes([df dataingestion], df dataingestion.Library.unique(), Ratio’)
# Data wrangling
df datawrangling = pd.read csv(’DataWrangling.csv’, \
sep=’; , names=names, dtype=schema, header=None).round(2)
df datawrangling[Ratio’] = df datawrangling.CpuTime / df datawrangling.WallTime
df ratio tables += transform dataframes([df datawrangling], df datawrangling.Library.unique(), ’Ratio’)
# Data io
df dataio = pd.read csv(’DataIO.csv’, sep=’; , \
names=names, dtype=schema, header=None).round(2)
df dataio [ Ratio’] = df dataio.CpuTime / df dataio.WallTime
df ratio tables += transform dataframes([df dataio], df dataio.Library.unique(), Ratio’)
# Data exploration
df dataexploration = pd.read csv(’DataExploration.csv’, sep=’; , names=names, dtype=schema, header=None).round(2)
df dataexploration [ Ratio’] = df dataexploration.CpuTime / df dataexploration.WallTime
92
Mathias Asplin Big data analytics and processing
df bigdata = pd.read csv(’BigData.csv’, sep=’; , names=names, dtype=schema, header=None).round(2)
df bigdata[ Ratio] = df bigdata.CpuTime / df bigdata.WallTime
df medium = df bigdata[df bigdata.Operation.str.contains(’medium’, case=False, regex=True)].copy()
df large = df bigdata[(df bigdata.Operation.str.contains( large , case=False, regex=True)) &
(˜df bigdata.Operation.str.contains( very’ , case=False, regex=True))].copy()
df verylarge = df bigdata[df bigdata.Operation.str. contains( verylarge , case=False, regex=True)].copy()
df many = df bigdata[df bigdata.Operation.str.contains(many’, case=False, regex=True)].copy()
df list = [df dataexploration, df medium, df large, df verylarge , df many]
df ratio tables += transform dataframes(df list, df dataexploration.Library.unique(), Ratio’)
for iin range(8):
df ratio tables [ i ] = df ratio tables [ i ]. round(3)
df ratio tables [ i ]. columns = df ratio tables [ i ]. columns.str. replace( , )
df ratio tables [ i ]. index = df ratio tables [ i ]. index.str . replace( , )
for iin range(3, 8):
df ratio tables [ i ]. columns = [f’Task {i}for iin range(1, 19)]
df sub table = df ratio tables [0]. loc [[ Pandas python’, ’Pandas c’, Pandas numpy’]]
df sub table = df sub table. fillna ( df sub table .min())
df dataio = df ratio tables [0]. copy()
df dataio . loc [[ Pandas python’, ’Pandas c’, Pandas numpy’]] = df sub table
df read merge sum = df dataio.mean(axis=1).drop(
[ Pandas c’, Pandas python’]
). reset index (name = ’Read Merge’)
df data wrangling sum = df ratio tables [1]. mean(axis=1).reset index(name = ’Data Wrangling’)
df read write csv sum = df ratio tables [2][
[ Read csv’, Write csv ]
]. mean(axis=1).reset index(name = ’Read Write Csv’)
df read write parquet pa sum = df ratio tables [2]. drop(
[ Polars outofcore ]
)[
[ Read pyarrow parquet’, ’Write pyarrow parquet’]
]. mean(axis=1).reset index(name = ’R/W Parquet (PA)’)
df read write parquet r sum = df ratio tables [2]. loc [
[ Polars inmem’, ’Polars outofcore ]
][
[ Read rust parquet’, Write rust parquet’]
]. mean(axis=1).reset index(name = ’R/W Parquet (R)’)
df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),
[
df read merge sum,
df data wrangling sum,
df read write csv sum,
df read write parquet pa sum,
df read write parquet r sum
]). sort values (by = ’Library’)
df ratio tables .append(df merged.set index(’Library’))
df merged = df merged.set index(’Library’)
df largest = df merged.apply(pd.Series.nlargest, n=1, axis=0).reindex(df merged.index, axis=0)
df merged = df merged.reset index(’Library’)
df largest df largest . isnull ()] = 1
df ratio tables .append(df largest)
df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),
[
df ratio tables [3]. dropna(how=’all’, axis=0).dropna(axis=1) \
.mean(axis=1).reset index(name=’Small’),
df ratio tables [4]. dropna(how=’all’, axis=0).dropna(axis=1) \
.mean(axis=1).reset index(name=’Medium’),
df ratio tables [5]. dropna(how=’all’, axis=0).dropna(axis=1) \
.mean(axis=1).reset index(name=’Large’),
df ratio tables [6]. dropna(how=’all’, axis=0).dropna(axis=1) \
.mean(axis=1).reset index(name=’Verylarge’),
93
Mathias Asplin Big data analytics and processing
df ratio tables [7]. dropna(how=’all’, axis=0).dropna(axis=1) \
.mean(axis=1).reset index(name=’Many’)
])
df merged = df merged.set index(’Library’).reindex( \
df ratio tables [3]. index.unique(), axis=0).reset index(names = ’Library’)
df merged = df merged.sort values(by = ’Library’)
df ratio tables .append(df merged.set index(’Library’))
df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’index’ ], how=’outer’, sort=True),
[
df ratio tables [3]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \
.value counts (). reset index (name = ’Small’),
df ratio tables [4]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \
.value counts (). reset index (name = ’Medium’),
df ratio tables [5]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \
.value counts (). reset index (name = ’Large’),
df ratio tables [6]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \
.value counts (). reset index (name = ’Verylarge’),
df ratio tables [7]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \
.value counts (). reset index (name = ’Many’)
])
df merged = df merged.rename(columns={’index’: ’Library’})\
. set index( Library ). reindex( df ratio tables [3]. index, axis=0)
df merged = df merged.sort values(by = ’Library’)
df ratio tables .append(df merged)
df tables += df ratio tables [8:]
df tables [0] = df tables [0][
[ Read single , Merge single , Read merge’]
]
df tables [1] = df tables [1][
[ Artists single , Boolean single , Datetime single , Duration single , Data wrangling all ]
]
df tables += [df tables[4]/ df tables [3], df tables [5]/ df tables [4], \
df tables [6]/ df tables [5], df tables [7]/ df tables [5]]
df tables += [pd.DataFrame({’Library’: [’Pandas’, ’Polars’, Polars (R)’, Duckdb’, ’Daft’, Dask’],
Parquet’: [515, 515, 623, 725, 515, 742],
Csv’: [1741, 1728, np.nan, 1674, 2032, 1734]}). set index( Library )]
table max = [50, 50, 50, 50, 100, 500, 1000, 1000, 100, 1, 5000, 15, 5, 1, 5, 15, 30, 100, 100, 100, 2032]
for table , size max in zip( df tables , table max):
fig = plt. figure ( figsize =[i6for iin (3,2)])
ax = sns.heatmap(table.T, annot=True, lw=0, cbar=False, fmt=’g’, \
annot kws={’fontsize’: 24}, vmax=size max)
ax.tick params(axis=’both’, labelsize =18)
ax.set (xlabel=None)
ax.set (ylabel=None)
plt . xticks(rotation=30)
plt . yticks(rotation=0)
plt . tight layout ()
plt .show()
Listing 230: Graphs Code
94