VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF Free Download

Name: VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF
Author: Linda Lewis

1 / 100

0 views•100 pages

VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF Free Download

VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF free Download. Think more deeply and widely.

M¨alardalen University

School of Innovation, Design and Engineering

V¨aster˚as, Sweden

Thesis for the Degree of Bachelors of Science in Computer Science -

15.0 credits

VERTICAL SCALING FOR BIG DATA

ANALYTICS AND PROCESSING - A

CASE STUDY

Mathias Asplin

man18002@student.mdu.se

Examiner: Abu Naser Masud

M¨alardalen University, V¨aster˚as, Sweden

Supervisor(s): Masoud Daneshtalab

M¨alardalen University, V¨aster˚as, Sweden

06/02/2025

Mathias Asplin Big data analytics and processing

Abstract

Big data is often used as a buzzword but despite this common usage of the term it is still

poorly understood. This paper explores new methods of handling big data based on technological

advancements in big data tools, cloud computing capabilities and ﬁle storage. This paper focuses on

new tools such as Polars, Duckdb and Daft to determine how mature these tools are and whether

they hold their promises. By using a busy environment consisting of a 4 core machine with 16GB

of RAM, Duckdb and Daft were able to analyze and process datasets ranging from a few million

rows to over a billion rows. While Polars fell behind as the data size grew, it proved formidable

for data engineering tasks consisting of data ingestion, wrangling and I/O. With the ability of

these tools to perform out of core computations on datasets many times larger than RAM and

cloud vendors oﬀering machines with thousands of cores and tens of terabytes of RAM on a single

machine, analyzing and processing big data has never been more simplistic. Reduction in code

complexity, environment complexity, compute costs and lowering the skill ceiling for data analytics

and engineering problems are a few advantages when compared to distributed big data tools such as

Apache Hadoop and Spark.

Mathias Asplin Big data analytics and processing

Table of Contents

1. Introduction 1

1.1 Abriefhistoryofdata.................................. 1

1.2 Thetoolsneeded ..................................... 1

1.3 Bigdatainpractice ................................... 2

1.4 ProblemFormulation................................... 3

1.5 Scopeandlimitations .................................. 3

2. Background 5

2.1 Whatisbigdata? .................................... 5

2.2 Doesbigdatamatter? .................................. 5

2.3 Tools for big data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Distributed data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Howtousedata ..................................... 7

2.6 ToolsforETLprocesses ................................. 8

2.6.1 MicrosoftExcel.................................. 8

2.6.2 Relationaldatabases............................... 8

2.6.3 Python ...................................... 8

2.6.4 Commonly used Python libraries . . . . . . . . . . . . . . . . . . . . . . . . 8

2.7 Dataframes ........................................ 9

2.8 Cloudcomputingconcepts................................ 9

2.8.1 Objectstorage .................................. 9

2.8.2 Compute ..................................... 9

2.8.3 Serverless ..................................... 10

2.9 Datastorage ....................................... 10

2.9.1 OLTPdatabases ................................. 10

2.9.2 OLAPdatabases ................................. 10

2.9.3 Datalakehouse.................................. 10

2.10 New tools for data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.10.1 Dask........................................ 10

2.10.2 DuckDB...................................... 11

2.10.3 Polars ....................................... 11

2.10.4 Daft ........................................ 11

2.10.5 ApacheParquet ................................. 11

2.10.6 ApacheArrow .................................. 11

2.11 Eager and Lazy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3. Related Work 13

3.1 Databases......................................... 13

3.2 Research using vertical scaling tools . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Bigdatamindset..................................... 13

4. Method 14

5. Ethical and Societal Considerations 15

6. The experiments 16

6.1 Overview ......................................... 16

6.2 Dataingestion ...................................... 16

6.3 Datawrangling...................................... 17

6.4 DataI/O ......................................... 17

6.5 Data exploration and analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.6 Detailsabouttests .................................... 18

Mathias Asplin Big data analytics and processing

7. Results 20

7.1 DataIngestion ...................................... 20

7.1.1 Pandas ...................................... 20

7.1.2 Polars ....................................... 20

7.1.3 Duckdb ...................................... 20

7.1.4 Daft ........................................ 20

7.1.5 Dask........................................ 20

7.2 DataWrangling...................................... 20

7.2.1 Pandas ...................................... 20

7.2.2 Polars ....................................... 20

7.2.3 Duckdb ...................................... 21

7.2.4 Daft ........................................ 21

7.2.5 Dask........................................ 21

7.3 DataIO.......................................... 21

7.3.1 Pandas ...................................... 21

7.3.2 Polars ....................................... 21

7.3.3 Duckdb ...................................... 22

7.3.4 Daft ........................................ 22

7.3.5 Dask........................................ 22

7.3.6 Files........................................ 22

7.4 Data Ingestion, Wrangling and IO summary . . . . . . . . . . . . . . . . . . . . . . 23

7.4.1 Generaloverview................................. 23

7.4.2 CPUutilization.................................. 23

7.5 Data Exploration/Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.5.1 Smalldataset................................... 24

7.5.2 Mediumdataset ................................. 25

7.5.3 Largedataset................................... 26

7.5.4 Verylargedataset ................................ 26

7.5.5 Manyﬁlesdataset ................................ 26

7.6 DataAnalyticssummary................................. 27

7.6.1 Generaloverview................................. 27

7.6.2 CPUutilization.................................. 27

7.6.3 Walltime diﬀerence between datasets . . . . . . . . . . . . . . . . . . . . . . 28

8. Discussion 36

8.1 Benchmarking....................................... 36

8.1.1 Overview ..................................... 36

8.1.2 Continous improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.2 Practicalconsiderations ................................. 36

8.3 Practicallimitations ................................... 37

8.4 Costs and relation to cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . 37

8.5 New perspectives on big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.6 Business considerations and dynamic data storage . . . . . . . . . . . . . . . . . . 39

8.7 Bigdatainuniversities.................................. 39

9. Conclusions 40

9.1 Generalconclusions.................................... 40

9.2 Pandasevolution ..................................... 40

9.3 Polars ........................................... 40

9.4 DuckdbandDaft..................................... 40

9.5 Paired with cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

9.6 FutureWork ....................................... 40

References 45

Appendix A Appendex Python Imports 45

iii

Mathias Asplin Big data analytics and processing

Appendix B Appendix Pandas Code 46

Appendix C Appendix Polars In Memory Code 51

Appendix D Appendix Polars Out Of Core Code 56

Appendix E Appendix Duckdb Code 61

Appendix F Appendix Duckdb View Code 69

Appendix G Appendix Daft In Memory Code 76

Appendix H Appendix Daft Out Of Core Code 80

Appendix I Appendix Dask Code 85

Appendix J Appendix Graphs Code 90

Mathias Asplin Big data analytics and processing

List of Figures

1 Printout of a Pandas dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Simpliﬁed architecture of the Parquet ﬁle both in memory and in storage . . . . . 11

3 Data Ingestion walltime average over 5 runs . . . . . . . . . . . . . . . . . . . . . . 21

4 Data Wrangling walltime average over 5 runs . . . . . . . . . . . . . . . . . . . . . 22

5 Filesizesinmegabytes.................................. 23

6 Data IO walltime average over 5 runs . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Data summary for ingestion, wrangling and I/O . . . . . . . . . . . . . . . . . . . 25

8 Data summary best performance for ingestion, wrangling and I/O . . . . . . . . . 26

9 Data summary of CPU utilization for ingestion, wrangling and I/O as a factor of

CPU time divided by walltime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10 Data summary of best CPU utilization performance for ingestion, wrangling and I/O 28

11 Data Analytics small dataset walltime average over 5 runs . . . . . . . . . . . . . . 29

12 Data Analytics medium dataset walltime for one run . . . . . . . . . . . . . . . . . 29

13 Data Analytics large dataset walltime for one run . . . . . . . . . . . . . . . . . . . 30

14 Data Analytics very large dataset walltime for one run . . . . . . . . . . . . . . . . 30

15 Data Analytics many ﬁles dataset walltime for one run . . . . . . . . . . . . . . . . 31

16 Data Analytics summary for all runs, compared between datasets . . . . . . . . . . 31

17 Data Analytics best performance summary for all runs, compared between datasets 32

18 Data Analytics summary of CPU utilization as a factor of CPU time divided by

walltime for each dataset size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

19 Data Analytics summary of best CPU utilization performance between the tools for

eachdatasetsize ..................................... 33

20 Data Analytics time increase as a factor when going from a small dataset to a

mediumdataset...................................... 33

21 Data Analytics time increase as a factor when going from a medium dataset to a

largedataset ....................................... 34

22 Data Analytics time increase as a factor when going from a large dataset to a very

largedataset ....................................... 34

23 Data Analytics time increase as a factor when going from a large dataset to a very

largedatasetinmanyﬁles................................ 35

Mathias Asplin Big data analytics and processing

1. Introduction

“The world’s most valuable resource is no longer oil, but data.”

This is the title of an Economist article from 2017 [1]. Data secretly runs the world. It is

everywhere and has never been as ubiquitous as now. Knowing how to use and handle data is

becoming increasingly diﬃcult as the volumes grow. The term big data is often thrown around

in adjacency with subjects such as data analytics, data science, artiﬁcial intelligence and machine

learning, but what does it mean and how does it diﬀer from traditional data? This is a question

that is constantly being asked by companies all around the world and they still do not agree with

each other on a precise deﬁnition.

1.1 A brief history of data

When computers were ﬁrst invented, the philosophy around data boiled down to keeping as little

data as possible and not wasting any bits in the system. With the advances in computer hardware,

both software developers and companies noticed how they were not bound to such restrictions

anymore and by collecting data they could also analyze it.

The concept of analyzing data has been around for thousands of years. It was proven valuable

in gaining a competitive advantage or to improve decision making. Romans even used data to

predict the likelihood of events [2]. While the data analysis was historically done with pen and

paper, there was a change in the technology medium as computers became more sophisticated.

This technology developed out of sync however, while computers increased in performance, the

ability to store data increased even more. This is the fundamental change that led to the concepts

of big data and can be summarized to data that is too big to process in a single machine either

due to time to process or it simply not ﬁtting into the machine’s main memory.

Data in the modern era comes from a variety of sources ranging from web based user content,

sensors, mobile devices, internet of things and shared through various mediums between companies.

The ability to process and analyze this data has become a serious concern for companies as only

6% state that they experience no problems with big data technology, according to BARC [3].

MongoDB, a NoSQL database provider, states that the overall amount of data that is stored

doubles every two years. In addition to this, some of the biggest challenges includes data quality,

data security, data processing and hiring skills big data engineers.

They write,

The applications for Big Data are practically limitless if organizations can ﬁnd enough

people with the skills to implement them. Not many people are actually trained in Big

Data, and businesses face a major shortage of experienced and certiﬁed data scientists

and data analysts.

1.2 The tools needed

While very large technology companies such as Google, Meta, Apple, Microsoft and Amazon have

the resources build their own tools to solve these problems, it is not as simple for the vast majority

of companies to do so.

Free and open source tools were used a lot but they came with their own problems. Hadoop,

which works on a distributed systems with their MapReduce model, lacked support for a general

programming model. Even as early as 2012, 6 years after initial release, companies experienced

problems in regards to performance and resource management, among other things [4].

In order to solve the problems with Hadoop, Spark was created and shared the same open

source environment as Hadoop. Spark was built to use the distributed ﬁle storage of Hadoop,

Mathias Asplin Big data analytics and processing

HDFS, but processed the data diﬀerently. While Spark did not share the same core issues as

Hadoop, it had problems of its own which came down to how distributed systems work in general.

Skewing, where one node got more data than others, spilling data from primary memory to disk

and shuﬄing the data between nodes [5]. These operations are unavoidable in a distributed system

but can be mitigated by skilled big data engineers.

Possible ways to avoid these issues includes the following,

•Buy services from cloud providers.

•Buy services from third parties.

•Buy super computers.

While these options work, each of them carry their own problem. Buying services from cloud

providers or third parties like Snowﬂake and Databricks can be expensive. They are prone to

the same risks of distributed computing as Spark as they often use distributed systems in the

background [6]. The companies also charge per query and data recovered which can be money

spilled due to errors in queries or unexpected bugs. Even if the user sometimes have the ability to

choose the amount of compute instead of a serverless conﬁguration, developers and analysts alike

might not understand or care to tweak it for their individual problems and thus opts for more

compute than they need. Super computers is not an option companies chooses due to the heavy

initial cost as well as the cost of upkeep [7]. Super computers are also not scalable once the data

eventually reaches the limits.

1.3 Big data in practice

So, how would a list of requirements for big data analytics and processing look like?

•Cost eﬀective.

•Scalable.

•Performant.

•Fault tolerant.

•Flexible.

•Easy to setup.

•Easy to use.

While the solutions above solve some of these points, there is no solution that encompass all of

them.

According to a study made by Fivetran, an ETL service provider, the average Snowﬂake cus-

tomer spends $300,000 per year, of which 90% are queries and of those queries, over 50% are data

processing queries. The study further concludes that the median query scans about 100MB of data

and 0.1% of all queries scans above 300GB of data [8].

Distributed solutions seemed to be the defacto standard when dealing with big data, mainly

due to Hadoop’s popularity and the lack of other open source alternatives. With advancements

in cloud computing and new open source tools emerging, is it time to reevaluate the premise that

drives big data analytics and processing, namely the distributed model?

The big beneﬁts of working with a vertical instead of horizontal scaling model are plenty.

Data skewing and data shuﬄing are no longer issues. With solid tools and frameworks the code

Mathias Asplin Big data analytics and processing

complexity decreases and the bar for working with big data is decreased drastically. The cost is

also decreased due to not needing multiple machines. Modern cloud providers oﬀers machines with

tens of terrabytes of primary memory and hundreds of cores [9].

Additional beneﬁts includes the ability to use laptops or cheap hardware to conduct complex

analytics should the data be big enough to ﬁt in storage but not in the primary memory.

Distributed analytics and processing will always have its place, for example for I/O heavy

computations or to add multiple GPUs. What vertical scaling oﬀers applies to the majority but

not all cases.

This paper will explore the tools and concepts which lies the foundation of a new way of thinking

regarding the ability to analyze and process large amounts of data. While the experiments will be

conducted on tabular data, the methods can equally be applied to any other type of data processing

such as image processing. It will also touch on the advancements of cloud technologies that enables

single machine data processing to scale.

1.4 Problem Formulation

This paper is a case study which primarily focuses on the tools and frameworks for big data

analytics and processing, mainly looking at tools for tabular datasets. It will evaluate tools in

terms of speed, memory stability and ability to perform computations on data outside of the

primary memory.

The question this paper aims to answer is which tools exist, how can they be used for big

data analytics and processing as well as what the average user should expect from them. The

benchmarks are done primarily as a way to illustrate their strengths, limitations and evolutions

rather than to compare them in terms of speed. Multiple tools can be used in conjunction to solve

complex problems.

Some of the core principles of vertical scaling includes.

•Cost eﬀective data analytics and processing.

•Possibility of vertical scalability.

•Performant tools and frameworks.

•Fault tolerant data analytics and processing.

•Reduction in complexity regarding setup and usage.

Cost eﬀective data analytics and processing is important to keep costs low in any environment,

whether it is a company environment, research environment or for personal usage. Vertical scal-

ability is important to adjust for both small and large data loads. It is important to ensure the

tools and frameworks actually utilizes all of the computers resources such as all of the cores and

I/O capabilities fully. For complex analytics it is necessary to ensure if a fault happens it happens

gracefully or the ability to reduce the sources or faults in general. Lastly, it is important to ensure

a reduction in technical complexity compared to a distributed setup, both in terms of the compute

environment and the usage of the compute environment.

1.5 Scope and limitations

1. This case study will not go into the extreme cases nor go into areas where analytics is not

used, for example distributed request handling of a large website split across several servers.

Mathias Asplin Big data analytics and processing

2. Even though some of the tools used in the benchmarking can use GPUs for processing, this

will not be considered due to the GPUs themselves requires in memory processing.

3. Only the primary means of writing queries will be tested, so if a tool has an SQL interface

that is not the primary means of using the tool, it will not be considered.

4. This case study will not use custom functions.

5. This case study will also not consider all of the options, conﬁgurations or settings. It will

have an ”out of the box” approach.

Mathias Asplin Big data analytics and processing

2. Background

2.1 What is big data?

This is a question that is constantly being asked by companies all around the world and they still

do not agree with each other on a precise deﬁnition.

Michael Chen at Oracle, a software company known among other things for their database

writes that

“Big data refers to extremely large and complex data sets that cannot be easily managed

or analyzed with traditional data processing tools, particularly spreadsheets.” [10]

They further continues by clarifying the characteristics of big data, namely the ﬁve V’s, volume,

velocity, variety, veracity and value.

Informatica, a software company that focuses on data engineering tools, writes

“Big data refers to the 21st-century phenomenon of exponential growth of business data,

and the challenges that come with it, including holistic collection, storage, management,

and analysis of all the data that a business owns or uses.” [11]

They further continues by clarifying the characteristics of big data, the four V’s, volume,

velocity, variety and veracity.

Google also has a diﬀerent take on big data, they write as follows

“Big data refers to extremely large and diverse collections of structured, unstruc-

tured, and semi-structured data that continues to grow exponentially over time. These

datasets are so huge and complex in volume, velocity, and variety, that traditional data

management systems cannot store, process, and analyze them.” [12]

While these three diﬀerent takes on what big data is does have similarities, it still leaves

questions unanswered. Is my data big data? Do I need to use big data tools? How many V’s do I

need to consider? Three like Google, four like Informatica or ﬁve like Oracle?

2.2 Does big data matter?

Some companies might even question the need for analysing and handling big data at all. Why

bother with costly infrastructure and staﬀ to manage an analytics platform? According to Google,

companies in general perform better when they’re able to analyze their data. [12]

•“58% of companies that make data-based decisions are more likely to beat revenue targets

than those that don’t.”

•“Organizations with advanced insights-driven business capabilities are 2.8x more likely to

report double-digit year-over-year growth.”

•“Data-driven organizations generate, on average, more than 30% growth per year.”

“We don’t have better algorithms. We have more data.”

Mathias Asplin Big data analytics and processing

This is a quote from Peter Norvig, Google’s Research Director [13]. His quote refers to how more

data is more eﬀective in training Machine Learning models rather than ﬁne tuning their models,

however this can also be applied to regular statistics and other forms of statistical modeling. For

companies, being data driven is becoming a necessity and with big data it is becoming a challenge.

For the small companies it can be about costs. Staﬀ, data analytics platforms and services to

handle the data. For large companies it can be about data complexity, integration with legacy

systems and processing costs. These are merely the business aspects of problems with big data.

While big data poses a challenge, research also shows that companies who can handle their big

data also beneﬁt from it. In the end, the decision to become data driven is entirely individual,

but in a world where data has been hailed as “the gold/oil of the 21th century”, the question of

whether the companies not willing to adapt will suﬀer because of it is becoming a yes.

2.3 Tools for big data processing

Tools for big data can be considered a subject of its own where the ecosystem consists of thousands

of diﬀerent variants, each with their strengths and weaknesses as well as overlaps. It is therefore

unreasonable that anyone knows all of them. This paper will focus speciﬁcally on the processing

tools and only bring up a few popular tools and services today without explaining how they work

in detail.

Apache Hadoop launched in 2006. It quickly became a defacto standard for big data processing

due to its ability to distribute the data with its Hadoop Distributed File System (HDFS) and

workload via its MapReduce paradigm on multiple machines, thereby leveraging a combined eﬀort

to solve the processing problems [14].

Apache Spark release in 2013 as an improvement of the MapReduce paradigm of Hadoop, which

required a linear ﬂow which was not suitable for all processing demands. Spark relies on the HDFS

solution of Hadoop and is also designed primarily for distributed tasks [15], [16].

There are plenty of companies that oﬀers data analytics as a service, most notably Snowﬂake,

Databricks, Amazon AWS, Microsoft Azure and Google Cloud. What all of these services and

companies have in common is that their solutions are based on a distributed architecture to leverage

the scalability of machine clusters [8]. While it is possible to run these tools and services on

single machines, the overhead voids the beneﬁts. When using the services for data analytics and

processing, the question of cost comes into the picture.

Mathias Asplin Big data analytics and processing

2.4 Distributed data processing

Hadoop has been popular in the big data industry for two primary reasons, one, its HDFS which en-

ables storage of data across multiple machines and two, its MapReduce programming model which

enables processing across these machines. While the technology has constantly been improved

upon, the philosophy still remains the same, to split the data and processing across multiple

worker nodes which enables them to work together. This has many advantages.

•Horizontal scalability. If the processing is slow, simply add more worker nodes.

•Vertical scalability. Nodes can use multi-core processing.

•Batched processing. It does not matter if the data ﬁts into the main memory or not.

•Persistent storage across multiple devices as a logical unit. Should one worker node fail, there

is redundancy on other machines.

While there are advantages, there are also disadvantages.

•Environment complexity. Setting up and conﬁguring the main node and worker nodes is a

challenge by itself.

•Programming complexity. If data needs to be transferred between nodes often there is a

reduced performance gain. Bugs can be extremely diﬃcult to solve.

•Cost. This comes in many shapes and forms. The cost of setting up a cluster. The cost of

maintaining a cluster. Network costs. Storage costs, including HDFS redundancy storage

cost.

•Optimization diﬃculties. Even with competent staﬀ it can be diﬃcult to optimize the uti-

lization of the cluster. If it is not in use it drains money, but if it’s shut down there’s a cost

of getting it back up.

This technology has been improved upon during the past decade but the disadvantages are

still there. These distributed solutions requires the scale to be truly above the capacity of a single

machine in order to be eﬃcient and necessary. The overhead of of these systems when used for a

single node only helps to drain the resources.

2.5 How to use data

While distributed data processing explains how to handle big data, it does not explain the use

cases for big data, or data in general. While the many components needed for systems to handle

big data is out of scope for this paper, knowing what Extract, Load and Transform, commonly

referred to as ETL, actually means is crucial for understanding why data engineering is not a trivial

task as it may seem at ﬁrst glance.

Extract is the process of taking data from one or multiple sources, sometimes from other local

databases, other times from APIs of other systems. Load is the process of loading it into a program

or database. Finally, Transform is the process of changing its shape to suit the needs. This is what

this paper will be focusing on as it often is a bottleneck in many data processing systems. Some

examples of work can be found below.

•Cleaning data which includes removing unnecessary columns, renaming columns, removing

or ﬁxing bad values, changing data types, renaming words and a lot more.

•Sorting data is also necessary, often alphabetically or date wise.

•Grouping data, perhaps the same product or product group is a core feature.

•Data aggregations such as summations, means or standard deviations are also often necessary.

Mathias Asplin Big data analytics and processing

•Combining data through joins or concatenations.

•Custom functions.

After the ETL process is done it can be used for other purposes, including dashboards for

daily management, queried by data scientists or analysts to use in data modeling or statistics, sent

or sold to customers, used to train or update Artiﬁcial Intelligence models and more. The ETL

process can happen multiple times before data becomes consumption ready and it can even be ran

several times a day if necessary.

2.6 Tools for ETL processes

While Hadoop and Spark can be used for ETL it is not a necessity. Unlike preparing large datasets

for a Machine Learning model to train, ETL can be both small and large in scope, usually a mix

of the two. This means most ETL processes today does not need a large cluster to be processed,

it can require a small cluster or a single node. While there are “low/no code” tools and services,

these will not be considered due to their inﬂexibility.

2.6.1 Microsoft Excel

Excel is a widely known tool for many purposes. A typical use is data handling, with its tabular

format it is easy both to handle and visualise data. Many ETL processes uses Excel as a processor,

even automated ETL processes due to its ability to include scripts. However, this environment

is not ideal for handling large amounts of data. It has a limit of around 1 million rows and 16

thousand columns and is not ﬂexible enough for all use cases [17].

2.6.2 Relational databases

Relational databases is a widely used database structure similar to the tabular format of Excel,

although it is Excel that has taken inspiration from relational databases than the other way around.

Its tabular format allows for multiple beneﬁts such as minimal storage and not storing the same

data multiple times. Structured Query Language, SQL, was designed to perform operations on

these databases, from joining multiple tables to ingesting or extracting data.

It is a powerful system capable of planning a query in advance in order to process the query

as fast as possible, including using multiple cores. If the user wants to remove all entries before a

date, the query engine will simply not read that data, for example. This is another common way to

perform operations on an ETL pipeline however it is not ﬂexible as it needs to run on a database

and it cannot perform the same type of operations as a general purpose programming language.

2.6.3 Python

Python has become a popular programming language for data analytics and processing due to

its rich environment of libraries and ease of use because of its clean syntax and code being inter-

preted rather than compiled. Whether the use case is building web applications, making charts

or building Machine Learning models, it has libraries that supports each case in depth with very

few lines of code. While it is popular it has downsides such as its Global Interpreter Lock (GIL)

mechanism, locking the program to a single thread of execution. In addition to this, the language

is considered slow in its execution due to its inability to let the compiler to perform optimizations

before execution [18], [19].

2.6.4 Commonly used Python libraries

Due to Python’s limitations in performance, libraries are often created in other, more performant

languages. One such example is NumPy which is written in Python and C. Its core focus is

numerical mathematical functions, capable of building multidimensional arrays and using a variety

of diﬀerent mathematical functions, focusing on bridging the performant C programming language

Mathias Asplin Big data analytics and processing

to the simple syntax of Python as well as to have a cohesive library of widely used functions and

operations [20].

A second library that was built on top of the NumPy library is Pandas. Pandas stands for

Panel Data, an econometrics term for time series data. Pandas is able to structure data into

datasets called dataframes, resembling tables. It diﬀers in its underlying architecture however, as

it stores data column wise in memory, making it an eﬀective choice for processing data column

wise, which most data operations requires. It is designed to eﬀectively manipulate data whether

it’s ﬁltering, combining or using custom functions. It is integrated into other libraries such as

Matplotlib, making it easy to turn the dataset into charts for visualization. Pandas is also built in

high performant languages such as C and Cython, a performant version of Python [19], [21].

2.7 Dataframes

Dataframes are 2D tables organized as rows and columns as ﬁgure 1. They process data column-

wise and stores the data as columns rather than rows. This makes column-wise operations faster.

Dataframes usually supports common operations for data management and analytics such as se-

lecting subsets by ﬁltering, sorting, aggregations, group bys, joining, replacing and pivoting.

Figure 1: Printout of a Pandas dataframe

2.8 Cloud computing concepts

The topic of cloud computing is vast. Cloud computing is widely used for data storage, analytics

and processing. The three biggest cloud service providers consist of Amazon AWS, Microsoft

Azure and Google Cloud. In general, these providers share the same core services such as renting

computers and storage. This paper will use AWS terminology, but do note both Azure and Google

Cloud have similar services [22].

2.8.1 Object storage

The ﬁrst fundamental concept is the object storage, called S3 by AWS, sort for Simple Storage

Service. It can be described as an abstract way of storing ﬁles. AWS sorts ﬁles into buckets and

each bucket can contain an unlimited amount of ﬁles of any type, up to a size of 5TB per ﬁle. The

best way to imagine it is an inﬁnitely sized secondary storage device like SSD or HDD. Files can

be divided into folders much like a regular secondary storage device. Depending on usage, data

stored in an object storage can be referred to as a data lake [23], [24].

2.8.2 Compute

The second fundamental concept is called EC2 by AWS, short for Elastic Compute Cloud. It

is a computer renting service of diﬀerent sizes. The smallest contains a single core and 0.5GB

Mathias Asplin Big data analytics and processing

of memory while the largest has thousands of cores and tens of terabytes of memory. The main

idea behind this service is being able to rent a computer for any time needed, whether its for a

few minutes or for years. It supports anything from bare metal instances to any ﬂavor of Linux,

Windows or MacOS through ISO images which can be described as a snapshot of an operating

system. Any computer rented only contains the software provided by the ISO image and the act

of adding more software or loading data into the computer is referred to as provisioning [25].

2.8.3 Serverless

Serverless refers to the ability to run code without renting a speciﬁc EC2 instance. It automatically

provisions the computer and runs the code when necessary, this can be as a part of an ETL or as

an API service [26].

2.9 Data storage

It is important to clear up common misconceptions and diﬀerences between the main types of data

storage options and how they are used.

2.9.1 OLTP databases

OLTP databases are the most common type of database, used for transactional processing of

all kinds. The distinctive diﬀerence is both how it is used and how it works internally. OLTP

databases are primarily used for adding rows, updating rows or changing data in cells. Because of

their use case, the data is stored row-oriented in memory and on the storage medium. Microsoft

SQL server, MySQL, PostgreSQL and MongoDB are typically row-oriented [27].

2.9.2 OLAP databases

OLAP databases, often referred to as datawarehouses, are designed for analytical workloads. These

diﬀers from OLTP workloads as they typically only involve reads and they also work on columns

rather than rows. An example of this is having a column with total sales and querying it for

the sum of all sales. These store data column-wise both in memory and on the storage medium.

Examples of databases with OLAP capabilities are AWS Redshift, Azure Synapse Analytics and

SAP HANA [27].

2.9.3 Data lakehouse

While databases, both OLTP and OLAP, requires a computer to be running at all times, data

lakehouses leverages object storages to store data and only uses a computer to run queries. This

enables a separation of storage and compute which drastically reduces costs as storage is cheaper

and the queries are not run at all times. They can be queried with SQL much like a regular

database, but they work diﬀerently internally. Examples of managed data lakehouses are Snowﬂake,

Databricks and AWS S3 Tables. An example of data lakehouse technology is Apache Iceberg [28].

2.10 New tools for data processing

While Pandas (released 2008) and NumPy (released 2006 as NumPy, 1995 as Numeric) has been

around for decades and are widely adopted, to varying degrees, new libraries and tools have emerged

only in the past few years.

2.10.1 Dask

Released in 2015, Dask aimed to provide a parallel and distributed environment for Python pro-

grams. Its dataframes are build on top of Pandas to parallelize and distribute computations. It is

primarily designed as a distributed environment similar to Apache Spark. Dask works lazily and

only evaluates the data when needed and then returns a Python, NumPy or Pandas object. Dask

Arrays works with NumPy objects, Dask Bags with Python lists and Dask dataframe with Pan-

das. It supports real time eager function evaluation as well as batched lazy function evaluations. It

Mathias Asplin Big data analytics and processing

has support for Machine Learning through scikit-learn wrappers, enabling distributed and parallel

PyTorch, Tensorﬂow and Keras applications [29]–[31].

2.10.2 DuckDB

Released in 2019, DuckDB is a column oriented relational database management system focusing

on online analytical processing. It is designed for data science and data engineering workloads

and promises ACID compliance similar to a regular relational database. It prides itself in being

incredibly ﬂexible and portable, able to run almost anywhere, including Windows, Linux and

Mac, both x86 and ARM CPU architectures, web browsers, mobile phones and inside programs

themselves [32]–[34].

2.10.3 Polars

Started in 2020, Polars promises fast data processing through vectorization, parallelization as well

as lazy evaluation and out of core processing. It is written in Rust and promises no external

dependencies as well as full control of API, memory and execution of the code. Much like pandas

it is a dataframe oriented library rather than a database or optimization library [35], [36].

2.10.4 Daft

Released in 2022, Daft refers to itself the ”Uniﬁed Engine for Data Engineering, Analytics &

ML/AI”. Like Polars, it is written in Rust. It is designed for data loads of megabytes to petabytes.

It is both parallel and distributed. Daft only works lazily and its use cases range from data en-

gineering and analytics to machine learning. Unlike Polars, it is multimodal and enables image

processing for machine learning applications. Unlike Dask, which can run on multiple types of clus-

ters, including Hadoop, Daft only works on Ray, which is a distributed data processing framework

designed for large scale training of machine learning models [37].

2.10.5 Apache Parquet

Released in 2013, it is a column-oriented data ﬁle format that is designed for fast read and writes.

It promises high performance compression and encoding schemes to handle complex data in bulk.

It also stores metadata for better analytical processing [38].

Figure 2: Simpliﬁed architecture of the Parquet ﬁle both in memory and in storage

2.10.6 Apache Arrow

Open source project started in 2016, the goal being to standardize data structures in dataframe

implementations, improve interoperability with other data processing systems as well as promote

Mathias Asplin Big data analytics and processing

collaboration and code reusage across libraries and programming languages. It also works to ad-

dress issues of limited data types, excessive memory consumption, poor performance for operations

with strings and to accommodate larger than memory datasets [39].

2.11 Eager and Lazy evaluation

A short description is that eager evaluation executes the instructions as soon as they can and lazy

evaluation executes when needed [40]. Both concepts have advantages and disadvantages. For

eager evaluation, it can be faster on smaller datasets because it does not need the added overhead

that lazy operations do and it releases resources as soon as an expression is evaluated. For lazy

evaluation, the expression tree can be optimized to perform fewer computations before execution

and resources like memory and I/O operations can be optimized or shared for the entire expression

tree.

Mathias Asplin Big data analytics and processing

3. Related Work

3.1 Databases

While there has been plenty of research put into database systems, data warehouses and query

optimizations, these advancements have yet to consider new technology such as Apache Arrow and

Apache Parquet as well as the database engines and dataframe tools which uses them [41]. While

there are some papers that focuses on Duckdb, these are made by businesses that seek to market

their product based on the Duckdb engine rather than exploring data analytics and processing

tools [42]–[45].

3.2 Research using vertical scaling tools

For big data processing and analytics distributed solutions still dominate, however there are papers

where the authors have opted to use Polars or Duckdb instead of Hadoop and Spark. Both [46]

and [47] were able to leverage Polars and Duckdb not only to enable big data processing but also

to achieve better performance than distributed solutions.

3.3 Big data mindset

While some research papers includes single node big data tools, a good example of the current

mindset regarding big data can be found in [48] from 2022 and it also perfectly illustrates the

beneﬁts of vertical scaling.

“With the development of data sensing and Internet of Things technologies, the scale

of data continues to increase to reach TB, PB, and even higher levels, which can no

longer be completed by a single computer. This puts forward higher requirements for

the computing power and timeliness of the computer to process big data. Big data

processing involves the conﬁguration of the storage system, the division of computing

tasks, the distribution of computing load, the data migration between computers, and

the data security when the computer or network fails. The situation is much more

complicated.”

Mathias Asplin Big data analytics and processing

4. Method

This paper will mostly explore the emerging tools mentioned in background in a test environment

using an empirical experiment model. The test environment will be ran in a Jupyter notebook

environment on a busy system which has resources dedicated to other uses. It is meant to represent

a typical analytics exploration environment professionals uses instead of a server that is dedicated

to big data processing. This sets the bar low for most use cases. The primary concern is to test

whether these tools can do what they claim rather than how fast they can do it. While speed is

important and will be tested, it is extremely diﬃcult to benchmark in a fair way since the tools

specialize on diﬀerent things. It is easy to understand how Dask, a framework for distributed

processing, should not be compared in speed to libraries optimized for single node processing.

Nevertheless it is interesting to measure how this diﬀerence in performance can be leveraged for

big data that can be run on a laptop or similar small scale instances.

This paper will also explore the underlying ﬁle formats such as comma separated values (csv)

versus the parquet format. While both are used, it is important to measure their strengths and

weaknesses and take note of when a ﬁle format should be favored over the other. It will also explore

the associated systems such as cloud computing but this will be done as an overview rather than

actually testing them. It is also important to note that while this paper explores vertical scaling,

it will not actually use cloud computing to vertically scale the system for testing, instead all of the

tests are done on a local Windows 10 environment with 4 cores and 16GB of memory. Because of

the busy environment setup, these cores and memory cannot be utilized fully as unknown amount

of CPU and 10GB of memory is used by other processes.

It is further important to emphasize that these tools have a lot of cutomizable options and

exploring them all would be diﬃcult within the scope of this paper. For example Polars has a SQL

interface, a streaming mode and can be ran on the GPU. Even though GPU acceleration would

increase the performance and there might be a diﬀerence in performance between the Python

interface and SQL interface, as well as a performance increase for handling larger than memory

data. These settings are not obvious enough and only the advanced users would be aware of them.

Polars also recommends compiling Polars from the source with a Rust compiler for maximum

performance and using bigidx for rows exceeding 232 ≈4.2B rows. This paper therefore takes an

”out of the box” approach and only makes small changes such as the compression format of the

parquet ﬁles to ”snappy”, as the default diﬀers between the diﬀerent tools.

These tools also has diﬀerent implementations which are subtle but noticeable in comparisons.

An example is when sorting values, Polars defaults to NULL values being placed ﬁrst, while Pandas

places them last. This means in order to get the ten largest values, another operation is required for

Polars in order to drop the NULL values. Tools to address these issues exist, both Ibis and Narwhals

aims to standardize these operations and make them consistent between dataframe libraries and

SQL engines [49], [50]. At ﬁrst glance it might seem strange not to use Ibis to standardize the

code but there exists multiple problems, namely that the Ibis API does not support Pandas, Dask

or Daft and its API for Polars and Duckdb is limited.

The data collected from the experiments will be analyzed using Python/Pandas and the visu-

alization libraries Seaborn/Matplotlib. It will be transformed and processed with Pandas.

Mathias Asplin Big data analytics and processing

5. Ethical and Societal Considerations

While the data is sourced from Kaggle under the Open Data Commons Open Database License,

a license intended to allow users to freely share, modify, and use this Database while maintaining

this same freedom for others, there are other aspects to consider that goes beyond the sourcing

and usage of data [51].

With the emergence of AI systems based on data, there has been and will always be good and

bad actors. A bottleneck to perfect these AI systems lie in the ability to handle the vast amounts of

data. This has historically been done via expensive or large scale systems which requires resources

and expertise to use and maintain. This paper explores tools and ideas which drastically reduces

the complexity of these big data systems to be handled by non-experts without training beyond

the usage of libraries similar to Pandas or SQL databases.

Even with the lowered ceiling for handling data, it is far fetched to propose the idea that this

will lead to more bad actors. Handling big data requires data in the ﬁrst place and this is something

expensive to gather and store. The tools is not everything either, knowing how to handle data is

also a key aspect, from mathematical foundations to knowledge of how computers work. The bar

is lowered, but this is for those who are already familiar with tools such as Pandas.

On the positive aspects, the methods in this paper could drastically bring down costs and

computational wastefulness. Some of the authors of the tools brought up in this paper deﬁnes big

data as data that goes beyond their tools, which easily ranges up to the three digit terabytes or

petabytes. This is rare for companies, even large companies, to actually handle in an analytics

environment. This together with lowering the ceiling for handling big data for professionals enables

analytics that was previously impossible or diﬃcult to do. Because of the cost, a single or small

team of data scientists either could not or required big data skills and systems in order to perform

analytics on datasets that would not ﬁt into memory.

Mathias Asplin Big data analytics and processing

6. The experiments

6.1 Overview

One of the important questions of this paper is to answer whether the tools can perform the big

data analytics and processing at all, rather than how fast they are. This is why this paper does

not compare the results to distributed solutions such as Hadoop or Spark. Any vertical system

should in theory perform better than any distributed system should they be equal in resources.

This paper has therefore chosen the dataframe tools described in the backgrounds section. It

will explore whether the most widely used dataframe tool Pandas has advanced enough to be able

to perform big data tasks or if it still limited in its range. It will also introduce newer tools such

as Dask, Polars, Duckdb and Daft. These tools provide a multitude of settings, conﬁgurations and

options. While this paper does not explore each of them in detail, it will explore the diﬀerence

between eager and lazy execution as this is key to the ideas of out of core computing and streaming

queries.

The dataset used comes from the user Mcfurland on Kaggle, an open source competition plat-

form for data science and machine learning [51]. It is 7.36GB of multiple csv ﬁles containing

millions of songs, artists, labels and audio features from Beatport, an online music store, as well as

Spotify, a music streaming service. Only the Spotify portion of the dataset was used however as

the dataset was disjoint between the diﬀerent sources. The Spotify dataset contains 5.77M tracks

and 4.68M audio features.

The algorithms used explores four aspects, with the last part being the largest one. The

algorithms are chosen in such a way to represent realistic usage of the tools, rather than making

the operations overly diﬃcult or complex. Any complex workﬂow can be divided into subsections

regardless, so testing if these tools can handle multiple consecutive group by and joins does not

make much sense.

•Data ingestion

•Data wrangling

•Data I/O

•Data exploration/analytics

The versions of each library is listed as.

•dask 2024.11.0

•duckdb 1.1.3

•getdaft 0.3.11

•numpy 2.1.3

•pandas 2.2.3

•polars 1.12.0

•pyarrow 18.0.0

6.2 Data ingestion

Data ingestion is divided into two parts. The ﬁrst part reads the raw csv ﬁles and the second

part joins the tables together into one big table. For Pandas, diﬀerent options are tested, both

the newer PyArrow engine and datatype backend, as well as the older types of the Python and C

engines and NumPy backend.

Mathias Asplin Big data analytics and processing

6.3 Data wrangling

Data wrangling is divided into four parts, each cleaning the dataset. The ﬁrst part changes the

duration column from miliseconds into seconds. The second part turns the explicit column from

string into boolean. The third part ﬁlls missing datetime values and turns them into datetime

objects. The fourth part ﬁnds all artist names for tracks and combines them into the original

dataset, this is to make it easier to associate a track with all of its artists.

6.4 Data I/O

Data I/O is divided into four parts, except for Polars which has a rust implementation for their

parquet writer, which is tested separately. The ﬁrst part writes to csv ﬁle. The second part reads

from csv ﬁle. The third part writes to parquet, the fourth part reads from parquet. For Pandas,

only the fastest engine will be considered for this part.

6.5 Data exploration and analytics

Data exploration/analytics is divided into 18 parts. For Pandas, both the NumPy backend and

the PyArrow backends will be tested. This part will be tested on the original dataset, which

uncompressed is about 1.7GB and 7.7M rows. For the larger datasets, the original dataset till be

duplicated row wise. It will be tested on a medium dataset, which is 5x the original size, then on a

large dataset which is 30x the original size. Finally it will be tested on a very large dataset which

is 150x the original size. There is also a set of multiple ﬁles equivalent in size of the very large

dataset. The largest dataset is above 250GB uncompressed and has more than 1 billion rows of

data with 24 columns.

1. Find top 10 popular tracks except the artists above with tracks between 2010 and 2020 that

does not contain profanity, is more than 3 minutes long, is good at dancing to or has good

energy, with a higher tempo than mean, that is not performed live.

Operations: This query is primarily about ﬁltering. It ﬁlters a list of artists with a NOT

OR statement, a range of dates, NOT explicit, greater than or equal to integers and

ﬂoats, aggregations of means and less than or equals to. It also sorts the values in

descending order, providing the top 10 results.

2. Find the artists who has made the most tracks.

Operations: This query includes a group by, get size and sort values.

3. Find the artist that has made the most popular songs on average.

Operations: This query includes a group by, mean aggregate and sort values.

4. Find the artist who on average makes the most danceable tracks.

Operations: This query includes a group by, mean aggregate and sort values.

5. Find the artist that makes the longest tracks on average.

Operations: This query includes a group by, mean aggregate and sort values.

6. Find the artist who makes the happiest tracks on average.

Operations: This query includes a group by, mean aggregate and sort values.

7. Find the artist that collaborates with most record labels on average.

Operations: This query includes a group by, unique, size and sort values.

8. Find the artist that has the most tracks with profanity.

Operations: This query includes a group by, size and sort values.

Mathias Asplin Big data analytics and processing

9. Find the year with most popular tracks on average.

Operations: This query includes a group by, mean aggregate and sort values.

10. Find the artist with the most popular tracks on average per year.

Operations: This query includes a group by on two columns, mean aggregate, sort values

and unique.

11. Find the artist with the most track credits.

Operations: This query varied but in basic form it included unique, count char in string,

adding and sort values.

12. Find the artist most active throughout the years.

Operations: This query included group by, max aggregate, min aggregate, subtraction, sort

values.

13. Find the album type distribution.

Operations: This query varied but in basic form it included group by, size, sum and division.

14. Find the album type that is the most popular on average.

Operations: This query includes group by and mean aggregate.

15. Find the mean of diﬀerent measurements.

Operations: This query includes the mean of multiple columns.

16. Find the median of diﬀerent measurements.

Operations: This query includes the median of multiple columns.

17. Find the standard deviation of diﬀerent measurements.

Operations: This query includes the standard deviation of multiple columns.

18. Find out what makes a track popular.

Operations: This query includes a group by, mean aggregate on multiple columns and sort

values.

6.6 Details about tests

The tests for all parts will be run ﬁve times for the regular dataset sizes. The time to run the

medium, large, very large and many ﬁles datasets is considerably higher, so they will only be run

once. The tests will measure if the operations works or if there’s a limit in terms of memory,

essentially if the operation crashes or it is clear the tool is at its limits due to taking too long time,

they will not run for the larger datasets. Each operation will be timed with the walltime in seconds

as this is the most representative way of evaluating the performance for the whole system. For the

eager evaluation each part will be timed individually and for the lazy evaluation two of the four

parts will be timed as a whole, namely the data ingestion and data wrangling parts.

The libraries will be tested for both their in memory capabilities and out of memory capabilities.

While Dask and Daft are both only lazy, dask reduces the result to a Pandas dataframe while Daft

has its own format. Therefore Daft is also tested by trying to force it to be eager, essentially

evaluating the parts one at a time rather than all together. Pandas is eager only and will be

treated such. While Duckdb is lazy, when creating a table it converts the data into its own format.

This is how a database is expected to work, however in this case the only usage that is interesting

is the data analytics and processing. Therefore tables will be considered as the normal Duckdb

option and the Duckdb Views as the out of core version. These are in principle not the same as

Mathias Asplin Big data analytics and processing

eager versus lazy evaluation, but it is the closest way to practically evaluate them versus other

tools. Duckdb Views has limitations however, mainly related to data processing as it does not

support dropping columns. Lastly, the Daft library does not support writing single ﬁles, instead

it uses unique UUIDs for each ﬁle and the only option is to name the folder these ﬁles are written

to. Daft itself determines how many ﬁles it shall write.

Generally, each library has its strengths and weaknesses. For example neither Dask, Daft nor

Apache Spark supports calculating a median because they are made for distributed environments.

If there is an operation that a library does not support, it will simply be ignored or worked around

without using Python code.

Mathias Asplin Big data analytics and processing

7. Results

7.1 Data Ingestion

7.1.1 Pandas

Pandas stood out in options for both reading csv ﬁles and merging them together, having three

engines for reading/writing csv ﬁles, namely the Python implementation, the C implementation

and the PyArrow implementation. It also supports two backends, NumPy and PyArrow. PyArrow

seemed to be ran in parallel by default, which can be seen by its low ingestion time compared

to the Python and C alternatives according to ﬁgure 3. While the Python engine is the most

conﬁgurable, it also seems to be the slowest by a large margin, about 50x slower than the PyArrow

implementation and 3x slower than the C implementation. The C engine was about 15x slower than

the PyArrow engine which can not be fully explained by it simply being parallel as the environment

only has 4 cores. It seems like it also handles data better in other ways which explains why it was

over 7x faster when joining the tables. Overall the Pandas PyArrow implementations were on par

with the other libraries, both for reading the ﬁles and for merging them.

7.1.2 Polars

Polars seemed to fair better when combining the two operations of reading and merging the ﬁles,

almost a 2x improvement. This can be explained by its internal query optimization which it uses

for lazy evaluations but not for eager ones.

7.1.3 Duckdb

Duckdb evaluates everything lazily but it was not as fast as Polars, being even slower than Polars

eager evaluation. This can partially be explained by Duckdb’s internal mechanisms that converts

the data into its own formats, which possibly requires extra steps.

7.1.4 Daft

Daft did not seem to outperform neither Polars or Duckdb and trying to force it to evaluate

the reading and the merging separately did not matter for the performance, possibly due to not

optimizing the query plan as well as Polars.

7.1.5 Dask

Dask used the PyArrow implementation which was overall about 3x slower than Pandas. This

is both surprising and not at the same time. Dask was made for distributed computing so the

overhead might require more on a single machine, but it also is able to carry out the executions in

parallel. A possible explanation is that it simply did not carry out any executions in parallel while

having a lot of overhead.

7.2 Data Wrangling

7.2.1 Pandas

Pandas seemed to struggle a lot with making a column that listed all artists for the tracks. The

PyArrow version seemed to be slower than the NumPy version according to 4, which could be

because of a bad implementation of the formats or something internal with Pandas. For the other

operations it kept up to pace with the other libraries.

7.2.2 Polars

Polars did not seem to beneﬁt a lot from the query planning for the data wrangling. This is to be

expected as it is no guarantee of a speedup. Overall it was slightly faster than the eager evaluation

and faster than the other libraries.

Mathias Asplin Big data analytics and processing

Figure 3: Data Ingestion walltime average over 5 runs

7.2.3 Duckdb

Duckdb already seemed to interpret the string column of explicit as a boolean so that operation

could not be tested. It performed better than the other libraries, except for Polars.

7.2.4 Daft

Daft seemed to do better when doing all of the operations at once. Even though it was slower than

Polars, it is clear the query plan massively improved the execution time as it executed all of the

operations faster than the ﬁrst one.

7.2.5 Dask

Dask once again required a longer time to execute the query than Pandas, which is surprising

because all of the operations are embarrassingly parallel.

7.3 Data IO

7.3.1 Pandas

Pandas C engine csv reader and NumPy data type backend was slower than the other libraries

according to 6. However the PyArrow read and write implementations were on par with the other

libraries.

7.3.2 Polars

Polars has a rust implementation of the parquet reader and writer, but this had mixed results,

being slower to read but faster to write for the eager mode and the opposite for the lazy mode.

This could be explained by the lazy mode requiring to read and then write the result back to ﬁle

however, since the source was the ﬁle itself. There is a roughly 3x speed improvement when using

the rust implementation.

Mathias Asplin Big data analytics and processing

Figure 4: Data Wrangling walltime average over 5 runs

7.3.3 Duckdb

Duckdb did not stand out in many ways, being average all around both when it came to reading

and writing the two ﬁle formats.

7.3.4 Daft

Daft stood out when it was reading the parquet ﬁles, being more than 10x faster than the other

libraries. This is probably not true and Daft is simply reading the metadata from the parquet ﬁles

or only reading a part of the data.

7.3.5 Dask

Dask was again slower than Pandas, but this time not by a lot. The parquet reader and writer was

on par with the other libraries, but like with Pandas, this is probably attributed to the PyArrow

format.

7.3.6 Files

From ﬁgure 5the ﬁle sizes of each of the libraries seemed to be the same, with some slight variance.

The csv ﬁles were 1.7GB ±50MB except for Daft. For the parquet ﬁles they varied more. The

Duckdb and Dask ﬁles were 725MB, the Polars (Rust) was 623MB, the Pandas, Polars (PyArrow)

and Daft ﬁles were 515MB. All of the ﬁles had the same compression format, snappy. Overall

it seemed like the parquet ﬁles were about 3x smaller than the csv ﬁles when using the snappy

compression.

Mathias Asplin Big data analytics and processing

Figure 5: File sizes in megabytes

7.4 Data Ingestion, Wrangling and IO summary

7.4.1 General overview

The overview provided in ﬁgure 7shows how some libraries stand out more. Together with the

display of the best performance in ﬁgure 8it is clear that Polars outperform the other libraries

in a general case for the common tasks of input/output, cleaning data and sorting it. The only

exception is Daft being faster with reading and writing parquet ﬁles but this is not necessarily the

case as previously stated.

Even though Dask and Daft are distributed libraries, Daft seems to outperform Dask for these

benchmarks in a single node environment. Both Duckdb and Daft seem to be generally slower

than Polars but only by a few factors, rather than the 100x diﬀerence between Polars and Dask

for the Data Wrangling tasks for example.

7.4.2 CPU utilization

When looking at an overview of the CPU utilization in ﬁgure 9and the best performance in ﬁgure

10 it gives another angle of how these libraries uses resources. High CPU utilization can both be

a good or a bad thing, but in this context, the mere ability to fully use the CPUs available is a

good sign. While both Polars and Duckdb has good CPU utilization, Polars is able to optimize the

queries to make them run faster or alternatively, the Duckdb implementation of this benchmark is

not optimal.

Mathias Asplin Big data analytics and processing

Figure 6: Data IO walltime average over 5 runs

7.5 Data Exploration/Analytics

7.5.1 Small dataset

The small dataset in ﬁgure 11 is the smallest tested and the only dataset to entirely ﬁt into memory.

It contained roughly 7.7M rows and common for all datasets, 24 columns. For the eager dataframes

it was ﬁrst loaded into memory. For the lazy dataframes it was sourced from a parquet ﬁle. Daft in

memory and out of core solutions would work the same, which is why only one version is present.

At ﬁrst glance it is clear that Dask is considerably slower than the other libraries. Theoretically

this should not persist due to Dask being able to parallelize the computations. Being strictly worse

than Pandas, no matter the implementation, seems odd. There are possibilities of misconﬁgurations

but they should be this evident.

Duckdb was faster than Polars this time around, for almost all tasks. This could be evident by

the fact that Duckdb is not made for data engineering tasks, to clean data, order it around and

merge ﬁles, but is instead built as an OLAP which these analytics tasks represent much better.

Polars on the other hand is made for both, which is why it is not as well optimized as Duckdb but

is also able to outperform Duckdb with the data engineering tasks.

While the in memory Duckdb implementation was faster than the out of core one, Polars was

the opposite with the out of core implementation being faster than the in memory version for

most tasks. Notably only the mean, median and standard deviation aggregations were slower.

This could be explained by Polars being able to leverage the metadata from parquet ﬁles more

eﬀectively whereas Duckdb might have similar indexing mechanisms internally.

Daft could not perform task 16 due to median not being implemented and Daft seemed to

struggle with task 7 too, which was considerably slower than the Pandas implementation. The

task itself was not unusual either, it had simple operations such as unique, size and sort values.

It is probably the unique operation Daft struggles with. As it is a distributed framework, it has

Mathias Asplin Big data analytics and processing

Figure 7: Data summary for ingestion, wrangling and I/O

to optimize for a distributed solution which has poor performance on a single node. For the other

operations Daft seemed to keep up with Polars.

7.5.2 Medium dataset

The medium dataset in ﬁgure 12 had roughly 38.5M rows and in its compressed form required

3.6GB, which would be roughly 11GB in a csv format. This dataset does not ﬁt into memory as

it would require more than 20GB of RAM. Because of this, Pandas can not handle the data or

perform any analytics on it.

Polars seems to struggle with the in memory implementation. While Polars can swap to disk,

it seems to be an expensive operation. There are some tasks it keeps up with Duckdb and Daft

on but in general it is slower. The out of core implementation seems to be slightly better than

Daft on this scale, with a few tasks that takes a long time to complete such as task 1 and task 11.

Overall it seems like the out of core implementation performs better than the in memory one at

this scale, but it highly depends on the task.

Duckdb manages to be performant on this scale but the in memory implementation struggles

with task 11 and the out of core implementation shows sign of struggling but manages to complete

the query as second fastest.

Daft can not complete task 7 which is no surprise considering it struggled on the small dataset.

Other than that it seemed to keep up pace with Polars and was about 2-3x slower than Duckdb

overall.

Dask is starting to perform better, even outperforming Polars for some tasks, while struggling

for others. Task 7 requires over 24 minutes to complete, compared to Duckdb which only requires

roughly 5 seconds.

Mathias Asplin Big data analytics and processing

Figure 8: Data summary best performance for ingestion, wrangling and I/O

7.5.3 Large dataset

The large dataset in ﬁgure 13 had roughly 192.5M rows and required 18.1GB in its compressed

format. This is roughly 54GB in a csv format.

Both Polars in memory implementation and Dask can not run the queries at this scale. Polars

out of core implementation is able to run the queries but much like the Duckdb in memory im-

plementation, they struggle with them. Both Polars and Duckdb are unable to run task 11 and

Polars out of core is unable to run task 1. Daft is able to run the queries but it struggles on task

1. It is the only library that can run task 11 at this scale too.

7.5.4 Very large dataset

The very large dataset in ﬁgure 14 had roughly 1.15B rows and required 108.8GB to store in its

compressed form. This is roughly 326GB in a csv format.

Only the Duckdb out of core implementation and Daft are able to run queries. Duckdb is

unable to run tasks 11 and 16 while Daft is unable to run tasks 1, 7, 11 and 16. Both frameworks

seems to be on par with eachother and the only big diﬀerence is on task 8 where Daft is very slow

compared to Duckdb.

7.5.5 Many ﬁles dataset

The very large dataset split up into 150 ﬁles in ﬁgure 15 had the same amount of rows and required

the same amount of space as the very large ﬁle.

Both Duckdb and Daft were able to run all of the queries in relative terms about the same

time. The only major diﬀerence is task 8 for Daft which is now about 15x faster.

Mathias Asplin Big data analytics and processing

Figure 9: Data summary of CPU utilization for ingestion, wrangling and I/O as a factor of CPU

time divided by walltime

7.6 Data Analytics summary

7.6.1 General overview

When looking at the overall summary between the diﬀerent tools in ﬁgure 16 it is much clearer

that Duckdb is overall faster until the dataset becomes very large, at which point Daft catches up.

Daft had a query when sourcing from a single very large ﬁle that it struggled with, which makes

it seem like multiple ﬁles is a lot faster, but it is only that single query that stands out.

Even though it seems like the medium dataset is faster than the small dataset for some tools,

this is not the case as Daft did not manage to complete one of the tasks which is present in the

small dataset. The results are calculated between the tools rather than between the ﬁle sizes.

When looking at the individual performance summary in ﬁgure 17 it is very clear that Duckdb

is fastest in the majority of queries and it is only when the data size becomes very large that Daft

becomes faster. Relatively speaking, they are very similar in speed however.

7.6.2 CPU utilization

In the ﬁgure 18 for the CPU utilization there is a clear trend of both Polars and Duckdb spending

less time utilizing the CPU as the datasets grow larger. Strangely enough, both Dask and Daft

seem to increase their CPU utilization, but as the dataset becomes very large Daft once again

spends less with utilizing the CPU.

When observing ﬁgure 19 it is clear that Duckdb is able to utilize the CPU better than other

tools. While this did not lead to better performance than Polars for the data ingestion, wrangling

and I/O tests, it seemed to contribute to the better performance for analytics. It is only when the

datasets become large and very large that Daft seem to utilize the CPUs better, but like Polars

Mathias Asplin Big data analytics and processing

Figure 10: Data summary of best CPU utilization performance for ingestion, wrangling and I/O

and Duckdb, this did not lead to a performance increase over Duckdb for the large dataset, only

for the very large dataset.

7.6.3 Walltime diﬀerence between datasets

When observing the ﬁgures 20,21,22,23 there is a clear trend of a non-linear growth as the

dataset becomes larger. There are some exceptions that can be explained by the tools or ﬁles

holding metadata. If the increase is high it is a good sign that the tool is struggling to handle the

dataset and queries. Not every tool follows the same pattern. Dask seems to be faster at analyzing

the medium dataset than the small dataset. The small dataset was loaded into memory as it ﬁts

in memory so Dask might perform better if lazily loaded from parquet ﬁles instead. Daft seems to

have the slowest growth, indicating an ability to handle larger datasets. However, Daft is also the

library that is unable to complete the most queries.

Mathias Asplin Big data analytics and processing

Figure 11: Data Analytics small dataset walltime average over 5 runs

Figure 12: Data Analytics medium dataset walltime for one run

Mathias Asplin Big data analytics and processing

Figure 13: Data Analytics large dataset walltime for one run

Figure 14: Data Analytics very large dataset walltime for one run

Mathias Asplin Big data analytics and processing

Figure 15: Data Analytics many ﬁles dataset walltime for one run

Figure 16: Data Analytics summary for all runs, compared between datasets

Mathias Asplin Big data analytics and processing

Figure 17: Data Analytics best performance summary for all runs, compared between datasets

Figure 18: Data Analytics summary of CPU utilization as a factor of CPU time divided by walltime

for each dataset size

Mathias Asplin Big data analytics and processing

Figure 19: Data Analytics summary of best CPU utilization performance between the tools for

each dataset size

Figure 20: Data Analytics time increase as a factor when going from a small dataset to a medium

dataset. A growth of 5x means it is linear.

Mathias Asplin Big data analytics and processing

Figure 21: Data Analytics time increase as a factor when going from a medium dataset to a large

dataset. A growth of 5x means it is linear.

Figure 22: Data Analytics time increase as a factor when going from a large dataset to a very large

dataset. A growth of 6x means it is linear.

Mathias Asplin Big data analytics and processing

Figure 23: Data Analytics time increase as a factor when going from a large dataset to a very large

dataset in many ﬁles. A growth of 6x means it is linear.

Mathias Asplin Big data analytics and processing

8. Discussion

8.1 Benchmarking

8.1.1 Overview

This paper only scratched the surface of the possibilities of these tools in an environment that is

realistic for data exploration rather than automated pipelines. As such, the results are limited in

scope and resources. While these tools can scale, that also depends on the environment. The tools

can handle data that is larger than memory and while some queries begin to struggle regardless of

tool, the dataset has to be many times the size of RAM before degradation is noticeable.

A quick summary of the results would boil down to Polars being seemingly better at data

engineering tasks while Duckdb and Daft are able to scale the analytics better This simply scratches

the surface of the how these tools can be used. Parquet is a major component not only in storing

the data in a compressed format so it requires less space but it also provides metadata that allows

these tools to more eﬀectively query the data.

The tools also have many diﬀerent settings, options and conﬁgurations to use but without

expertise it is diﬃcult to fully utilize them. Polars would probably be more performant if the

streaming mode is used and while it is one conﬁguration, it is not entirely obvious if the lazy

mode does not already use streaming mode. Duckdb converts the data into its own internal format

by default when creating a table, but when creating a view it uses the format of the ﬁles. Even

though this diﬀerence is not obvious, it was required to even measure the out of core performance

of Duckdb because unlike Polars, it is not ambiguous.

8.1.2 Continous improvements

It is important to emphasize that these tools are constantly being improved. They are still very

new and Polars is public about their roadmap to improve the tool while Duckdb, Daft and Dask

either do not have one or it is not public [52]. These improvements can be big, such as GPU

support, a better streaming engine or a better parquet reader. They can also be small, such as

supporting more API operations or more options for the API. As these tools mature they will

become more integrated into their environments and become more stable.

8.2 Practical considerations

Speed and ability to perform out of core computing are two of the most important aspects when

working with big data but they are not the only aspects. A part of the results has shown that

Pandas, while created in 2008, also improves constantly with the integration of the PyArrow

backend and PyArrow parquet readers in 2022 which shows massive speed improvements with

I/O operations compared to NumPy [53]. In fact, reading csv ﬁles proved to be fastest with the

PyArrow backend in Pandas, barely being faster than the Polars version. If the use case is ingesting

ﬁles, joining them and writing to ﬁles it seems like a good option, especially considering Pandas

has many options in the csv reader that can be used to handle data from the source. These options

might be necessary for messy data or data in encodings that the newer tools does not support yet.

Dask, while seemingly the worst alternative, is able to support almost all of Pandas API

operations and perform them as an out of core solution even on a single machine. This is great for

backwards compatibility with earlier systems built with Pandas.

Many of the principles can be applied to distributed computing too. Streaming as a concept and

parquet ﬁles are not unique to these tools. The advantages come in the simplistic implementation

of the tools. Rather than setting up Hadoop backend with Spark runtimes on multiple machines

which may or may not have enough memory in each node to run the query, it can be streamed

inside a single node. The overhead both in the skills required to write and run the programs is

Mathias Asplin Big data analytics and processing

almost completely removed. If a data analyst or data engineer knows of Pandas or SQL, they can

use Polars, Duckdb or Daft to scale the data analytics or processing needs.

8.3 Practical limitations

Major limitations in this paper was the storage and computing capacity. Storing hundreds of

gigabytes of data is diﬃcult to do even on modern computers. The total size of the data ﬁles is

250GB in its compressed form and the author could not reliably store it on a laptop and run it, due

to the hard drive not having enough space. The tools, even if they use streaming mode, can swap

data to disk, which lead to memory issues. Testing how well the parquet format holds against other

formats such as csv, hdf5, feather, pickle and avro is interesting but there is already consensus that

parquet is more performant, better at compression and holds metadata that is useful for analytics

workloads [54], [55]. Parquet is not perfect however and there are other formats that competes

with it such as Apache ORC and Lance. These ﬁle formats are either not supported by the tools or

do not have integrations with them yet. The Lance format, while optimized for Machine Learning

workloads, promises >100x speed improvement over parquet for random access workloads [56].

Computing capacity was also a limitation because of the reason stated above. The vertical

scaling is supposed to be simulated by the increase in workload sizes but this is not entirely

accurate. It would be interesting to test the datasets and much larger ones, going into the 10s or

100s of terabyte range, be tested in a cloud environment. The advancements of cloud technology

enables machines with tens of terabytes of RAM and hundreds of cores. For embarassingly parallel

workloads that data analytics usually are, they should give a considerable speedup and be able to

scale the workloads in a performant manner. If 4 cores and roughly 6GB of usage RAM is enough

to perform most queries on datasets with over 1B rows, using 100 cores and 600GB RAM should be

able to handle over 100B of rows, in theory. The cloud also has the advantage of networked disks

with object store that is able to store an unlimited amount of ﬁles according to AWS. Realistically,

storing large datasets as one ﬁle carries no real advantage and instead it could be divided into

multiple ﬁles, for example organized in folders according to date if it is time series data.

This paper does not include any programmer optimizations that is usually done in big data

environments. This can be separating ﬁles into a folder hierarchy and combining them for a query,

batch processing ﬁles, storing metadata about the data or using mathematical formulas to avoid

expensive computations. Again, while it is interesting to explore exactly how fast one can query

over a billion rows, it is not as necessary as exploring the baseline and workloads under stress.

8.4 Costs and relation to cloud computing

When discussing the cost there are many things to consider. The ﬁrst and obvious part is how a

decrease in execution time leads to lower costs and the second obvious part is using the smallest

instance possible. On-premise solutions would not be bound to cost but merely speed and if the

query can be executed. For cloud computing solutions it gets more complicated. Apart from

Pandas and Polars, where a GPU would be relevant, the important aspects to consider is the

amount of CPU cores, primary memory as well as networking if the data is stored in the object

storage S3, which is the most typical scenario.

Since the cost scales relatively linear with the amount of cores and memory available on the

instance, a simple way to reduce the cost even further is to use a larger instance. According to

ﬁgures 20,21,22,23 the scaling is non-linear in a general case as the dataset grows, which means

using a larger instance will lower the costs. Taking Duckdb out of core as an example, the scaling

was roughly linear at 5x the size of the data, meaning a median of 5x longer walltime to compute.

As the dataset grew by 5x again, the median walltime jumped to around 9x. As the dataset grew

by 6x, the median walltime is at around 22x. It seems like a cost optimal size is a few times larger

than RAM memory before the walltime begins to increase more rapidly.

Mathias Asplin Big data analytics and processing

When looking at available EC2 instances, AWS lists u7inh-32tb.480xlarge which has up to 1920

virtual CPUs and 32.768TB of memory. By doing approximate and rough calculations, assuming

the ratio stays relatively the same, such a machine could handle over one petabytes of data and a

few trillion rows of the same dataset. This is also the size Spark works reliably within [57].

8.5 New perspectives on big data

Much like Fivetran’s study, other analysis on big data systems have concluded the same thing,

the vast majority of use cases are not big data at all and do not require big data platforms [8],

[58]–[60]. Motherduck, a company that provides cloud database services using Duckdb claims the

following based on a paper from the Redshift team at AWS [45].

•95% of databases don’t qualify as Big Data and 99.98% of users never run big data queries.

•When people have big data, 99% of their queries are against smaller tables.

•99.5% of queries over “big data” tables, query “small data” amounts.

•From the above results, you need to use big data tools 0.5% of 1% of 5% of the time, which

is something like the “big data 0.00025%”.

This puts big data in companies into perspective for real world use cases. If companies uses

the same solution and heavier cost for each of these use cases they have a lot of money to save on

understanding their data pipelines.

Motherduck’s deﬁnition of big data diﬀers strongly as they consider 10TB of data ”small”.

This can mean databases with tables that has billions of rows which could require complex joins

with other tables. They do not consider data processing considerations either, as the data could

be stored in data lakes and only reﬁned data is stored in a data warehouse like Redshift.

Healey K. write his opinions on the issue in a blog post [60].

“Both [Snowﬂake, Databricks] of these platforms are charging their customers for Fer-

raris when most companies would be happy with a Toyota Camry with some After Mar-

ket parts. Don’t get me wrong, both platforms are doing impressive things; Snowﬂake’s

Data Cloud is really coming to life with their partnership with NVIDIA and Databricks

is putting runs on the board with the Spark Human API. These are not reasons you

should use their platforms. These are marketing gimmicks to garner the attention of

the decision-makers to loosen the company’s purse strings. This is a fundamental mis-

take in the way businesses operate. Pay to make a problem disappear. We operate as if

these platforms solve problems when fundamentally they do not and often exacerbate

large ineﬃciencies that already exist within an organization. If you want to make a big

impact be Anti-Hype.”

Other blog posts agree with Healey such as Sadeghi A who also refers to Motherduck and the

Redshift team at AWS and writes as a conclusion [45], [58].

“The rise of single-node processing engines represents a pragmatic shift in data ana-

lytics. As hardware capabilities continue to advance and single-node engines become

more sophisticated, the need for distributed processing will likely continue to decrease

for most organisations.

For the vast majority of companies, single-node processing frameworks oﬀer a more

eﬃcient, cost-eﬀective, and manageable solution to their data analytics needs. As we

move forward, the key is not to automatically reach for distributed solutions, but to

carefully evaluate actual workload requirements and choose the right tool for the job.

The future of data processing may well be less about managing clusters and more about

leveraging the impressive capabilities of modern single-node systems.”

Mathias Asplin Big data analytics and processing

Sadeghi goes further in depth regarding the aging of data and workload patterns, stating that

based on Meta and eBay data access patterns, most data is accessed in the ﬁrst 48 hours by ETL

pipelines and 95% of queries on the data are executed within the ﬁrst 30 days. They further

emphasize that.

“This pattern holds remarkably consistent across industries and use cases. Even in

organisations with large datasets, most analytical workloads operate on recent, aggre-

gated data that could easily ﬁt within single-node processing capabilities.”

8.6 Business considerations and dynamic data storage

The reason why companies go for solutions such as Snowﬂake and Databricks depends but it is

clear that costs and complexity can be drastically reduced by leveraging a solution that enables

data storage scaling while also being able to both utilize vertical and horizontal scaling methods for

the data processing and analytics. The downsides of data warehouses is that for lower data loads

there still needs to be an instance running. The downsides of managed platforms like Snowﬂake

and Databricks is being bound to their pricing model and compute engine.

Apache Iceberg is an example of an analytic table format that works as a data lakehouse [28].

It enables separation of storage and compute which essentially means data lake in the back and

data warehouse in the front. Such formats provides the means of using the simple approach of

vertical scaling for most query needs and only going horizontally when absolutely required.

While there are additional data management means that are useful, this further emphasizes

how the perspective on big data engineering is constantly shifting and why a dynamic approach to

data storage and processing is needed. Motherduck refers to trends going back to ”small data”, this

paper proposes that Polars, Duckdb and Daft enables big data analytics and processing everywhere

[59].

It is not just the fast evolution of cloud services and hardware that has made companies slow

to adapt to vertical scaling. Data warehouses are usually locked in solutions or vendor speciﬁc,

making a migration diﬃcult. In addition to this, a ”de facto” standard of Hadoop and Spark has

made companies comfortable with the tools even for smaller scale operations. Horizontal scaling

will deﬁnitely be relevant in the future, but a new mindset of how data is stored and how it can

be accessed by diﬀerent compute engines is needed for vertical scaling big data tools to become

standardized.

8.7 Big data in universities

A ﬁnal aspect to consider is what is taught in universities. M¨alardalen’s University (MDU) has

multiple courses on the topic of big data, either as a complement for another course component

or as a core aspect of the course itself. Out of the courses available on MDU’s website, only two

mentions the tools used for big data and both of those refer to Hadoop and Spark. One of which

even writes “The de facto Big Data processing frameworks: Hadoop and MapReduce”. While

teaching about distributed tools and methods is important, not exploring the single node methods

and pushing them to their limits gives a false view of how big data processing and analytics works

not just in theory but in practice too. This is not unique to MDU either. Uppsala University

(UU) and The Royal Institute of Technology (KTH) also has courses designed in a similar fashion

[61]–[64].

Mathias Asplin Big data analytics and processing

9. Conclusions

9.1 General conclusions

Multiple new tools and concepts have emerged in recent years to improve on parallel and distributed

computing for big data analytics and processing. These tools oﬀers new ways of processing data

by introducing out of core computing concepts which can eﬀectively work with data larger than

memory by combining new ﬁle formats that can both store metadata that helps with analytics

workloads and compress the data to take an average 3x less space compared to the csv format.

By using a streaming or hybrid streaming approach tools such as Polars, Duckdb, Daft and to

some extent Dask are able to analyze and process larger than RAM datasets. On a 4 core, 16GB

RAM machine in a busy environment where around 8GB memory is used by other processes,

Duckdb and Daft managed to do analytics and process a parquet ﬁle with the size of above 100GB

as well as 150 parquet ﬁles totalling about 100GB. These ﬁles contained above 1 billion rows.

9.2 Pandas evolution

Pandas, when using PyArrow backend, managed an above 10x speedup of computations such as

I/O of csv ﬁles and dataframe joins compared to the NumPy backend. For analytics and data

processing operations the PyArrow backend is faster for most operations involving numeric data

types but not string data types. Pandas still can not handle larger than memory datasets however.

9.3 Polars

Polars, a new dataframe tool with both eager and lazy modes, was able to perform better than

the other tools for basic data engineering tasks such as data ingestion, data wrangling and data

I/O. Polars is also able to handle larger than memory datasets but as the dataset grew larger it

was unable to keep up with Duckdb and Daft.

9.4 Duckdb and Daft

Both Duckdb and Daft were able to handle very large datasets, both from single ﬁles and multiple

ﬁles. While Daft performed worse than the other tools for the small datasets, it generally performed

the best for very large datasets. Duckdb proved itself in being very performant both for small and

very large datasets. The tools had their limitations on certain query tasks, with Daft being unable

to perform 4 out of 18 and Duckdb being unable to perform 2 out of 18.

9.5 Paired with cloud computing

With advancements in cloud computing, vendors such as Amazon Web Services now oﬀers single

node machines with up to thousands of cores and tens of terabytes of RAM. In theory, the new

tools and ﬁle format in combination with services such as EC2 and S3 are able to handle datasizes

up to one petabyte and trillions of rows.

Together, tools such as Polars, Duckdb and Daft are able to perform data analytics and process-

ing that previously was only possible in a distributed environment such as Hadoop or Spark. This

can in theory reduce code complexity, environment complexity, computing costs and dependence

on 3rd party services such as SAP, Snowﬂake and Databricks while still maintaining the ability to

handle big data eﬀectively. The tools also allow for a lower skill ceiling enabling data analysts, data

scientists and data engineers without knowledge of distributed systems and solutions to handle big

data.

9.6 Future Work

This paper focused on a general overview of whether vertical scaling and out of core processing

is a possibility for big data analytics and processing. As such, it included tools not designed for

Mathias Asplin Big data analytics and processing

big data processing or vertical scaling such as Pandas and Dask in order to give a fair comparison

to what was available when distributed computing became a defacto standard. In future work,

exploring the three tools that performed the best, Polars, Duckdb and Daft would be beneﬁcial.

A few examples include.

•Comparisons with distributed frameworks such as Spark and Dask on equal resources.

•Comparing Daft’s performance in a distributed environment versus a single node on equal

resources.

•Comparing numeric operations versus string operations.

•Explore these tools ability to process data on larger datasets above 10TB both on smaller

and larger instances.

•Explore diﬀerent operations such as joins on tables above one billion rows.

•Explore the SQL interfaces of Polars and Daft.

Mathias Asplin Big data analytics and processing

References

[1] The world’s most valuable resource is no longer oil, but data, Accessed: 2025-01-05. [Online].

Available: https : / / www . economist . com / leaders / 2017 / 05 / 06 / the - worlds - most -

valuable-resource-is-no-longer-oil-but-data.

[2] A short history of big data, Accessed: 2024-12-22. [Online]. Available: https://www.bigdataframework.

org/knowledge/a-short-history-of-big-data/.

[3] The most common problems companies are facing with their big data analytics, Accessed:

2024-12-22. [Online]. Available: https://barc.com/challenges-big-data-analytics/.

[4] Top 5 challenges for hadoop mapreduce in the enterprise, Accessed: 2024-12-22. [Online].

Available: https://www.bigdatawire.com/2012/10/29/top_5_challenges_for_hadoop_

mapreduce_in_the_enterprise/.

[5] 5 common performance problems in apache spark: How they impact your job execution time,

Accessed: 2024-12-22. [Online]. Available: https://vivekjadhavr.medium.com/5-common-

performance- problems- in- apache- spark- how- they- impact- your- job- execution-

time-850974e9af24.

[6] Key concepts & architecture, Accessed: 2025-01-05. [Online]. Available: https : / / docs .

snowflake.com/en/user-guide/intro-key-concepts.

[7] Supercomputing vs. distributed computing: A government primer, Accessed: 2025-01-05. [On-

line]. Available: https://datainnovation.org/2014/01/supercomputing-vs-distributed-

computing-a-government-primer/.

[8] How do people use snowﬂake and redshift? Accessed: 2025-01-05. [Online]. Available: https:

//www.fivetran.com/blog/how-do-people-use-snowflake-and-redshift.

[9] Amazon ec2 instance types, Accessed: 2024-12-22. [Online]. Available: https://aws.amazon.

com/ec2/instance-types/.

[10] What is big data? Accessed: 2024-09-29. [Online]. Available: https://www.oracle.com/big-

data/what-is-big-data/.

[11] Big data: What it is and what you need to know, Accessed: 2024-09-29. [Online]. Available:

https://www.informatica.com/se/resources/articles/what-is-big-data.html.

[12] What is big data? Accessed: 2024-09-29. [Online]. Available: https://cloud.google.com/

learn/what-is-big-data.

[13] Google’s infringenovation secrets, Accessed: 2025-01-05. [Online]. Available: https://www.

forbes.com/sites/scottcleland/2011/10/03/googles- infringenovation- secrets/

?sh=4add793f30a6.

[14] The history of hadoop, Accessed: 2025-01-05. [Online]. Available: https:// medium.com/

@markobonaci/the-history-of-hadoop-68984a11704.

[15] Apache spark and hadoop working together, Accessed: 2025-01-05. [Online]. Available: https:

//www.databricks.com/blog/2014/01/21/spark-and-hadoop.html.

[16] Apache spark history, Accessed: 2025-01-05. [Online]. Available: https://spark.apache.

org/history.html.

[17] Excel speciﬁcations and limits, Accessed: 2025-01-05. [Online]. Available: https://support.

microsoft.com/en- us/office/excel- specifications- and- limits- 1672b34d- 7043-

467e-8e27-269d656771c3.

[18] What is the python global interpreter lock (gil)? Accessed: 2025-01-05. [Online]. Available:

https://realpython.com/python-gil/.

[19] Pandas, Accessed: 2025-01-06. [Online]. Available: https : / / www . nvidia . com / en - us /

glossary/pandas-python/.

[20] Numpy, Accessed: 2025-01-05. [Online]. Available: https://numpy.org/.

[21] W. McKinney, “Data structures for statistical computing in python,” Jan. 2010, pp. 56–61.

doi:10.25080/Majora-92bf1922-00a.

Mathias Asplin Big data analytics and processing

[22] The 10 leading cloud service providers of 2024, Accessed: 2025-01-06. [Online]. Available:

https://www.cloudoptimo.com/blog/the-10-leading-cloud-service-providers-of-

2024/.

[23] Amazon s3, Accessed: 2025-01-06. [Online]. Available: https://aws.amazon.com/s3/.

[24] M. Zaharia, A. G. 0002, R. Xin, and M. Armbrust, “Lakehouse: A new generation of open

platforms that unify data warehousing and advanced analytics,” in 11th Conference on In-

novative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online

Proceedings, www.cidrdb.org, 2021. [Online]. Available: http : / /cidrdb.org/cidr2021 /

papers/cidr2021_paper17.pdf.

[25] Amazon ec2, Accessed: 2025-01-06. [Online]. Available: https://aws.amazon.com/ec2/.

[26] Aws fargate, Accessed: 2025-01-06. [Online]. Available: https://aws.amazon.com/fargate/.

[27] What’s the diﬀerence between olap and oltp? Accessed: 2025-01-05. [Online]. Available: https:

//aws.amazon.com/compare/the-difference-between-olap-and-oltp/.

[28] Apache iceberg, Accessed: 2025-01-06. [Online]. Available: https://iceberg.apache.org/.

[29] Parallel python fast and easy, Accessed: 2025-01-06. [Online]. Available: https://www.dask.

org/.

[30] Dask, Accessed: 2025-01-06. [Online]. Available: https://github.com/dask/dask.

[31] M. Rocklin, “Dask: Parallel computation with blocked algorithms and task scheduling,” Jan.

2015, pp. 126–132. doi:10.25080/Majora-7b98e3ed-013.

[32] Duckdb, Accessed: 2025-01-06. [Online]. Available: https://duckdb.org/.

[33] Duckdb — what’s the hype about? Accessed: 2025-01-06. [Online]. Available: https : / /

betterprogramming.pub/duckdb-whats-the-hype-about-5d46aaa73196.

[34] M. Raasveldt and H. M¨uhleisen, “Duckdb: An embeddable analytical database,” in Pro-

ceedings of the 2019 International Conference on Management of Data, ser. SIGMOD ’19,

Amsterdam, Netherlands: Association for Computing Machinery, 2019, pp. 1981–1984, isbn:

9781450356435. doi:10.1145/3299869.3320212. [Online]. Available: https://doi.org/

10.1145/3299869.3320212.

[35] Dataframes for the new era, Accessed: 2025-01-06. [Online]. Available: https://pola.rs/.

[36] R. Vink, S. de Gooijer, A. Beedie, et al.,Pola-rs/polars: Python polars 1.19.0, version py-

1.19.0, Jan. 2025. doi:10.5281/zenodo.14597402. [Online]. Available: https://doi.org/

10.5281/zenodo.14597402.

[37] Uniﬁed engine for data engineering, analytics & ml/ai, Accessed: 2025-01-06. [Online]. Avail-

able: https://www.getdaft.io/.

[38] What is parquet? Accessed: 2025-01-06. [Online]. Available: https://www.databricks.com/

glossary/what-is-parquet.

[39] Apache arrow overview, Accessed: 2025-01-06. [Online]. Available: https://arrow.apache.

org/overview/.

[40] Eager evaluation, Accessed: 2025-01-11. [Online]. Available: https : / /www.dremio.com /

wiki/eager-evaluation/.

[41] The history of databases, Accessed: 2025-01-06. [Online]. Available: https://www.thinkautomation.

com/histories/the-history-of-databases.

[42] K. Ayushman, “Faster reading with duckdb and arrow ﬂight on hopsworks: Benchmark

and performance evaluation of oﬄine feature stores,” M.S. thesis, KTH Royal Institute of

Technology, 2023. [Online]. Available: https://urn.kb.se/resolve?urn=urn:nbn:se:

kth:diva-337297.

[43] R. Atwal, P. Boncz, R. Boyd, et al., “Motherduck: Duckdb in the cloud and in the client,”

[Online]. Available: https://www.cidrdb.org/cidr2024/papers/p46-atwal.pdf.

[44] L. Kuiper, M. Raasveldt, and H. M¨uhleisen, “Eﬃcient external sorting in duckdb,” [Online].

Available: https://ceur-ws.org/Vol-3163/BICOD21_paper_9.pdf.

Mathias Asplin Big data analytics and processing

[45] A. van Renen, D. Horn, P. Pfeil, et al., “Why tpc is not enough: An analysis of the amazon

redshift ﬂeet,” in VLDB 2024, 2024. [Online]. Available: https://www.amazon.science/

publications / why - tpc - is - not - enough - an - analysis - of - the - amazon - redshift -

fleet.

[46] F. R. Calkoen, A. P. Luijendijk, K. Vos, E. Kras, and F. Baart, “Enabling coastal analytics

at planetary scale,” Environmental Modelling & Software, vol. 183, p. 106 257, 2025, issn:

1364-8152. doi:https://doi.org/10.1016/j.envsoft.2024.106257. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S1364815224003189.

[47] M. Bakken and A. Soylu, “Chrontext: Portable sparql queries over contextualised time series

data in industrial settings,” Expert Systems with Applications, vol. 226, p. 120 149, 2023, issn:

0957-4174. doi:https://doi.org/10.1016/j.eswa.2023.120149. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0957417423006516.

[48] J. Wang, C. Xu, J. Zhang, and R. Zhong, “Big data analytics for intelligent manufacturing

systems: A review,” Journal of Manufacturing Systems, vol. 62, pp. 738–752, 2022, issn:

0278-6125. doi:https://doi.org/10.1016/j.jmsy.2021.03.005. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0278612521000601.

[49] Ibis, Accessed: 2024-12-27. [Online]. Available: https://ibis-project.org/.

[50] Narwhals, Accessed: 2024-12-27. [Online]. Available: https : / / github . com / narwhals -

dev/narwhals.

[51] mcfurland, 10+ m. beatport tracks / spotify audio features, 2023. [Online]. Available: https:

//www.kaggle.com/ datasets / mcfurland / 10- m- beatport- tracks- spotify- audio-

features.

[52] Ritchie vink - polars 1.0 and beyond — pydata amsterdam 2024, Accessed: 2025-01-11. [On-

line]. Available: https://www.youtube.com/watch?v=yYAVrVMGaMY.

[53] Release notes, Accessed: 2025-01-11. [Online]. Available: https:// pandas .pydata.org /

docs/whatsnew/index.html.

[54] Avro vs. parquet, Accessed: 2025-01-11. [Online]. Available: https://www.snowflake.com/

trending/avro-vs-parquet/.

[55] Choosing the right big data ﬁle format: Avro vs. parquet vs. orc, Accessed: 2025-01-11. [On-

line]. Available: https://blog.det.life/choosing-the-right-big-data-file-format-

avro-vs-parquet-vs-orc-c868ffbe5a4e.

[56] Nimble and lance: The parquet killers, Accessed: 2025-01-11. [Online]. Available: https :

//materializedview.io/p/nimble-and-lance-parquet-killers.

[57] Apache spark faq, Accessed: 2025-01-11. [Online]. Available: https://spark.apache.org/

faq.html.

[58] Redshift ﬁles: The hunt for big data, Accessed: 2025-01-12. [Online]. Available: https://

motherduck.com/blog/redshift-files-hunt-for-big-data/.

[59] Big data is dead, Accessed: 2025-01-12. [Online]. Available: https : / / motherduck . com /

blog/big-data-is-dead/.

[60] Cached takes: 80% of companies do not need snowﬂake or databricks, Accessed: 2025-01-12.

[Online]. Available: https://kjhealey.medium.com/cached- takes-80- of-companies-

do-not-need-snowflake-or-databricks-5ebda64c0853.

[61] Course syllabus - machine learning with big data, Accessed: 2025-01-12. [Online]. Available:

https://www.mdu.se/en/malardalen- university/education/course- syllabus?id=

29200.

[62] Course syllabus - smart digital platforms: Cloud computing, security and big data, Accessed:

2025-01-12. [Online]. Available: https : / / www . mdu . se / en / malardalen - university /

education/course-syllabus?id=30017.

[63] Data engineering i, Accessed: 2025-01-12. [Online]. Available: https ://www. uu .se/en/

study/syllabus?query=47724.

Mathias Asplin Big data analytics and processing

[64] Id2221 data-intensive computing 7.5 credits, Accessed: 2025-01-12. [Online]. Available: https:

//www.kth.se/student/kurser/kurs/ID2221?l=en.

A Appendex Python Imports

import time

import duckdb

import daft as ft

import numpy as np

import pandas as pd

import polars as pl

import pyarrow as pa

import dask.dataframe as dd

from datetime import datetime

Listing 1: Python imports

Mathias Asplin Big data analytics and processing

B Appendix Pandas Code

engine = [’python’, ’c’ , ’pyarrow’ ][0]

dtype backend = [’numpy nullable’, ’pyarrow’][0]

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

df track = pd.read csv(

path to sp track, engine=engine, dtype backend=dtype backend

).drop(’updated on’, axis=1)

df artist track = pd.read csv(

path to sp artist track , engine=engine, dtype backend=dtype backend

).drop(’updated on’, axis=1)

df artist = pd.read csv(

path to sp artist , engine=engine, dtype backend=dtype backend

).drop(’updated on’, axis=1)

df release = pd.read csv(

path to sp release , engine=engine, dtype backend=dtype backend

). drop([ ’ release img ’ , ’updated on’], axis=1)

df audio features = pd.read csv(

path to audio features , engine=engine, dtype backend=dtype backend

).drop([’duration ms’, ’updated on’], axis=1)

Listing 2: Pandas Data Ingestion Read Code

# Merge/Join the columns to create one large table

df = df track.merge( df artist track , on=’track id’, how=’left’)

df = df.merge(df artist , on=’ artist id ’ , how=’left’)

df = df.merge(df release , on=’release id ’ , how=’left’)

df = df.merge(df audio features, on=’isrc’ , how=’left’)

# Drop last unnecessary columns

df = df.drop(

[ ’ isrc ’ , ’track number’, ’ release id ’ ,

’disc number’, ’preview url’ , ’ artist id ’ , ’upc’ ], axis=1

)

Listing 3: Pandas Data Ingestion Merge/Join Code

# Turn miliseconds into seconds

df[ ’duration’] = df.duration ms ∗10∗∗−3

df = df.drop(’duration ms’, axis=1)

Listing 4: Pandas Data Wrangling Duration Code

# Turn string into boolean

df[ ’ explicit ’ ] = df. explicit == ’t’

Listing 5: Pandas Data Wrangling Boolean Code

# Fill dates and turn to datetime

df[ ’ release date ’ ] = df. release date .apply(

lambda x: x + ’−01−01 00:00:00’ if x.count(’−’) == 0 else

x + ’−01 00:00:01’ if x.count(’−’) == 1 else

x + ’ 00:00:02’

)

df[ ’ release date ’ ] = pd.to datetime(

Mathias Asplin Big data analytics and processing

df. release date , format=’mixed’, errors=’coerce’

)

Listing 6: Pandas Data Wrangling Datetime Code

# Find all artist names for tracks

# and combine them then join it with the original dataframe

df = (

df .merge(df.groupby(’track id’ )[ ’artist name’ ]

.apply(list ). reset index (name=’artist name all’), on=’track id’)

)

df[ ’ artist name all ’ ] = df. artist name all . str . join(’ , ’ )

Listing 7: Pandas Data Wrangling Artists Code

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/pandas.csv’

path pyarrow = ’path/to/input/pandas pa.parquet’

Listing 8: Pandas Data I/O Paths Code

# Write to CSV

df . to csv(path csv, sep=’; ’)

Listing 9: Pandas Data I/O Write CSV Code

# Write to parquet (pyarrow)

df.to parquet(path pyarrow, engine=’pyarrow’)

Listing 10: Pandas Data I/O Write Parquet Code

# Read from csv

df = pd.read csv(path csv, sep=’;’ , engine=’c’)

Listing 11: Pandas Data I/O Read CSV Code

# Read from parquet (pyarrow)

df = pd.read parquet(path pyarrow, engine=’pyarrow’)

Listing 12: Pandas Data I/O Read Parquet Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

df[(˜df. artist name all . str .contains(( ’ |’ ). join ( artists ), case=True, na=False)) &

(df. release date >= datetime(2010, 1, 1)) &

(df. release date <datetime(2020, 1, 1)) &

(˜df . explicit ) &

(df.duration >= 3∗60) &

((df. danceability >= 0.8) |(df .energy >= 0.8)) &

(df.tempo >= df.tempo.mean()) &

(df. liveness <= 0.2)

]. sort values ( ’popularity’ , ascending=False).head(10)

Listing 13: Pandas Data Analytics/Exploration Task 1 Code

# Find the artists who has made the most tracks

(

df.groupby(’artist name’). size (). reset index (name=’len’)

Mathias Asplin Big data analytics and processing

. sort values ( ’ len ’ , ascending=False)

)

Listing 14: Pandas Data Analytics/Exploration Task 2 Code

# Find the artist that has made the most popular songs on average

(

df.groupby(’artist name’)[ ’popularity’ ]. mean().reset index(name=’most popular’)

. sort values ( ’most popular’, ascending=False)

)

Listing 15: Pandas Data Analytics/Exploration Task 3 Code

# Find the artist who on average makes the most danceable tracks

(

df.groupby(’artist name’)[ ’ danceability ’ ]. mean().reset index(name=’best dance’)

. sort values ( ’best dance’, ascending=False)

)

Listing 16: Pandas Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

(

df.groupby(’artist name’)[ ’duration’ ]. mean().reset index(name=’long song’)

. sort values ( ’long song’ , ascending=False)

)

Listing 17: Pandas Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

(

df.groupby(’artist name’)[ ’ valence’ ]. mean().reset index(name=’happy song’)

. sort values ( ’happy song’, ascending=False)

)

Listing 18: Pandas Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

(

df .groupby(’artist name’ )[ ’ label name’ ]. unique().apply(len)

. reset index (name=’collaborations’).sort values ( ’ collaborations ’ , ascending=False)

)

Listing 19: Pandas Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

(

df[ df. explicit ]. groupby(’artist name’). size ()

. reset index (name=’profanity’).sort values( ’ profanity ’ , ascending=False)

)

Listing 20: Pandas Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

(

df [ ’ year’ ] = df. release date . dt.year

df.groupby(’year’)[ ’popularity’ ]. mean().reset index(name=’most popular year’)

. sort values ( ’most popular year’, ascending=False)

)

Listing 21: Pandas Data Analytics/Exploration Task 9 Code

Mathias Asplin Big data analytics and processing

# Find the artist with the most popular tracks on average per year

df [ ’ year’ ] = df. release date . dt.year

(

df.groupby([’year’ , ’artist name’ ])[ ’popularity’ ]

.mean()

. reset index(name=’most popular artist year’)

. sort values ( ’ most popular artist year ’ , ascending=False)

.drop duplicates(’year’ , keep=’ﬁrst ’ )

. sort values ( ’ most popular artist year ’ , ascending=False)

)

Listing 22: Pandas Data Analytics/Exploration Task 10 Code

# Find the artist with the most track credits

df temp = (

df.merge(df.drop duplicates(’ track id ’ )[ ’ artist name all ’ ]

.str .count(’ , ’ )

. reset index (name = ’ artist credits ’ ),

left on =df.index, right on=’index’

)[[ ’ track title ’ , ’ artist credits ’ ]]

)

df temp[’ artist credits ’ ] = df temp. artist credits + 1

df temp.sort values (’ artist credits ’ , ascending=False).head(10)

Listing 23: Pandas Data Analytics/Exploration Task 11 Code

# Find the artist most active throughout the years

(

df.groupby(’artist name’)[ ’ release date ’ ]. apply(lambda x: x.max() −x.min())

. reset index (name = ’longest date’). sort values ( ’ longest date ’ , ascending=False)

)

Listing 24: Pandas Data Analytics/Exploration Task 12 Code

# Find the album type distribution

df temp = df.groupby(’album type’).size(). reset index (name = ’percent’)

df temp[’percent’ ] = (df temp[’percent’] / df temp[’percent’ ]. sum()) ∗100

df temp

Listing 25: Pandas Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

df.groupby(’album type’)[’popularity’ ]. mean().reset index(name = ’popularity’)

Listing 26: Pandas Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

(

df[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ , ’ speechiness’ ,

’tempo’, ’valence’ , ’duration’]

]. mean().reset index(name = ’mean’)

)

Listing 27: Pandas Data Analytics/Exploration Task 15 Code

# Find the median of diﬀerent measurements

(

df[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ , ’ speechiness’ ,

’tempo’, ’valence’ , ’duration’]

]. median().reset index(name = ’median’)

)

Mathias Asplin Big data analytics and processing

Listing 28: Pandas Data Analytics/Exploration Task 16 Code

# Find the standard deviation of diﬀerent measurements

(

df[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ , ’ speechiness’ ,

’tempo’, ’valence’ , ’duration’]

]. std (). reset index (name = ’std’)

)

Listing 29: Pandas Data Analytics/Exploration Task 17 Code

# Find out what makes a track popular

(

df.groupby(’popularity’)[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ , ’ speechiness’ ,

’tempo’, ’valence’ , ’duration’]

]. mean()

. reset index (names = ’popularity’).sort values (’popularity’ , ascending=False)

)

Listing 30: Pandas Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

C Appendix Polars In Memory Code

engine = [’python’, ’c’ , ’pyarrow’ ][0]

dtype backend = [’numpy nullable’, ’pyarrow’][0]

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/ sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

df track = pl.read csv(path to sp track ). drop(’updated on’)

df artist track = pl.read csv( path to sp artist track ).drop(’updated on’)

df artist = pl.read csv( path to sp artist ).drop(’updated on’)

df release = pl.read csv( path to sp release ). drop([ ’ release img ’ , ’updated on’])

df audio features = pl.read csv(path to audio features ).drop(

[ ’duration ms’, ’updated on’]

)

Listing 31: Polars In Memory Data Ingestion Read Code

# Merge/Join the columns to create one large table

df = df track. join ( df artist track , on=’track id’, how=’left’)

df = df.join( df artist , on=’ artist id ’ , how=’left’)

df = df. join( df release , on=’release id ’ , how=’left’)

df = df.join( df audio features , on=’isrc’ , how=’left’)

# Drop last unnecessary columns

df = df.drop(

[ ’ isrc ’ , ’track number’, ’ release id ’ , ’disc number’,

’preview url’ , ’ artist id ’ , ’upc’]

)

Listing 32: Polars In Memory Data Ingestion Merge/Join Code

# Turn miliseconds into seconds

df = df.with columns(

duration = pl.col( ’duration ms’) ∗10∗∗−3

).drop(’duration ms’)

Listing 33: Polars In Memory Data Wrangling Duration Code

# Turn string into boolean

df = df.with columns(

explicit = pl.col( ’ explicit ’ ). replace strict ({’ t’ : True, ’ f ’ : False})

)

Listing 34: Polars In Memory Data Wrangling Boolean Code

# Fill dates and turn to datetime

df = df.with columns(

release date = pl.when(

pl . col( ’ release date ’ ). str .count matches(’−’) == 0

).then(

pl . col( ’ release date ’ ) + ’−01−01 00:00:00’

).otherwise(

pl .when(

pl . col( ’ release date ’ ). str .count matches(’−’) == 1

).then(

pl . col( ’ release date ’ ) + ’−01 00:00:01’

).otherwise(

pl . col( ’ release date ’ ) + ’ 00:00:02’

)

). str .to datetime(’%Y−%m−%d %H:%M:%S’)

)

Mathias Asplin Big data analytics and processing

Listing 35: Polars In Memory Data Wrangling Datetime Code

# Find all artist names for tracks

# and combine them then join it with the original dataframe

df = df.group by(’track id ’ ). agg(

artist name all = pl.col (’artist name’)

).with columns(

artist name all = pl.col(’ artist name all ’ ). list . join(separator=’,’ )

). join(

df, on=’track id’

)

Listing 36: Polars In Memory Data Wrangling Artists Code

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/PolarsInMemory.csv’

path parquet = ’path/to/input/PolarsInMemory.parquet’

path pyarrow = ’path/to/input/PolarsInMemory pa.parquet’

Listing 37: Polars In Memory Data I/O Paths Code

# Write to csv

df. write csv(path csv, separator=’; ’)

Listing 38: Polars In Memory Data I/O Write CSV Code

# Write to parquet (rust)

df.write parquet(path parquet, compression=’snappy’)

Listing 39: Polars In Memory Data I/O Write Parquet (Rust) Code

# Write to parquet (pyarrow)

df.write parquet(path pyarrow, compression=’snappy’, use pyarrow=True)

Listing 40: Polars In Memory Data I/O Write Parquet (PyArrow) Code

# Read from csv

df = pl.read csv(path csv, separator=’;’ )

Listing 41: Polars In Memory Data I/O Read CSV Code

# Read from parquet (rust)

df = pl.read parquet(path parquet)

Listing 42: Polars In Memory Data I/O Read Parquet (Rust) Code

# Read from parquet (pyarrow)

df = pl.read parquet(path pyarrow, use pyarrow=True)

Listing 43: Polars In Memory Data I/O Read Parquet (PyArrow) Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

df. ﬁlter (

(˜pl. col( ’ artist name all ’ ). str .contains any( artists )) &

Mathias Asplin Big data analytics and processing

(pl . col(’ release date ’ ) >= datetime(2010, 1, 1)) &

(pl . col(’ release date ’ ) <datetime(2020, 1, 1)) &

(˜pl. col( ’ explicit ’ )) &

(pl . col( ’duration’) >= 3∗60) &

((pl . col( ’ danceability ’) >= 0.8) |( pl . col ( ’energy’) >= 0.8)) &

(pl . col( ’tempo’) >= pl.col(’tempo’).mean()) &

(pl . col( ’ liveness ’ ) <= 0.2)

). sort(’popularity’ , descending=True).head(10)

Listing 44: Polars In Memory Data Analytics/Exploration Task 1 Code

# Find the artists who has made the most tracks

df .group by(’artist name’ ). len (). sort (’ len ’ , descending=True)

Listing 45: Polars In Memory Data Analytics/Exploration Task 2 Code

# Find the artist who on average makes the most danceable tracks

df .group by(’artist name’ ). agg(

best dance = pl.col( ’ danceability ’ ). mean()

). sort( ’best dance’, descending=True)

Listing 46: Polars In Memory Data Analytics/Exploration Task 3 Code

# Find the artist that makes the longest tracks on average

df .group by(’artist name’ ). agg(

long song = pl.col (’duration’ ). mean()

). sort( ’ long song’ , descending=True)

Listing 47: Polars In Memory Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

df .group by(’artist name’ ). agg(

long song = pl.col (’duration’ ). mean()

). sort( ’ long song’ , descending=True)

Listing 48: Polars In Memory Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

df .group by(’artist name’ ). agg(

happy song = pl.col(’valence’ ). mean()

). sort( ’happy song’, descending=True)

Listing 49: Polars In Memory Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

df .group by(’artist name’ ). agg(

collaborations = pl.col(’label name’).unique().len ()

). sort( ’ collaborations ’ , descending=True)

Listing 50: Polars In Memory Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

(

df. ﬁlter ( pl . col ( ’ explicit ’ )). group by(’artist name’)

.len (). sort (’ len ’ , descending=True)

)

Listing 51: Polars In Memory Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

df.with columns(

year = pl.col (’ release date ’ ).dt.year()

Mathias Asplin Big data analytics and processing

). group by(’year’ ). agg(

most popular year = pl.col(’popularity’ ). mean()

). sort(

’most popular year’, descending=True

)

Listing 52: Polars In Memory Data Analytics/Exploration Task 9 Code

# Find the artist with the most popular tracks on average per year

df.with columns(

year = pl.col (’ release date ’ ).dt.year()

).group by([’year’ , ’artist name’ ]). agg(

most popular artist year = pl.col( ’popularity’ ).mean()

). sort(

’ most popular artist year ’ , descending=True

).unique(

’year’ , keep=’ﬁrst ’

). sort ( ’ most popular artist year ’ , descending=True)

Listing 53: Polars In Memory Data Analytics/Exploration Task 10 Code

# Find the artist with the most track credits

df.unique(’track id ’ ). select (

pl . col ( ’ track title ’ ),

artist credits = pl.col(’ artist name all ’ ). str .count matches(’,’) + 1

). sort( ’ artist credits ’ , descending=True).head(10)

Listing 54: Polars In Memory Data Analytics/Exploration Task 11 Code

# Find the artist most active throughout the years

df .group by(’artist name’ ). agg(

pl . col( ’ release date ’ ). max() −pl.col(’ release date ’ ). min()

). sort( ’ release date ’ , descending=True)

Listing 55: Polars In Memory Data Analytics/Exploration Task 12 Code

# Find the album type distribution

df.group by(’album type’).len().with columns(

(pl . col( ’ len ’ ) / pl .sum(’len’ )). alias ( ’percent’)

)

Listing 56: Polars In Memory Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

df.group by(’album type’).agg(pl.col( ’popularity’ ). mean())

Listing 57: Polars In Memory Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

df. select (

pl .mean(’danceability’, ’energy’, ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’)

)

Listing 58: Polars In Memory Data Analytics/Exploration Task 15 Code

# Find the median of diﬀerent measurements

df. select (

pl .median(’danceability’ , ’energy’, ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’)

)

Listing 59: Polars In Memory Data Analytics/Exploration Task 16 Code

Mathias Asplin Big data analytics and processing

# Find the standard deviation of diﬀerent measurements

df. select (

pl .std( ’ danceability ’ ),

pl . std( ’ energy’ ),

pl . std( ’ liveness ’ ),

pl . std( ’ speechiness ’ ),

pl . std( ’ tempo’),

pl . std( ’ valence ’ ),

pl . std( ’ duration’)

)

Listing 60: Polars In Memory Data Analytics/Exploration Task 17 Code

# Find out what makes a track popular

df.group by(’popularity’ ). agg(

pl . col( ’ danceability ’ ). mean(),

pl . col ( ’energy’ ). mean(),

pl . col ( ’ liveness ’ ). mean(),

pl . col ( ’ speechiness’ ). mean(),

pl . col ( ’tempo’).mean(),

pl . col ( ’valence’ ). mean(),

pl . col ( ’duration’ ). mean()

). sort(’popularity’ , descending=True)

Listing 61: Polars In Memory Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

D Appendix Polars Out Of Core Code

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/ sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

df track = pl.scan csv(path to sp track ).drop(’updated on’)

df artist track = pl.scan csv( path to sp artist track ). drop(’updated on’)

df artist = pl.scan csv( path to sp artist ).drop(’updated on’)

df release = pl.scan csv( path to sp release ). drop([’ release img ’ , ’ updated on’])

df audio features = pl.scan csv(path to audio features ). drop(

[ ’duration ms’, ’updated on’]

)

# Merge/Join the columns to create one large table

df = df track. join ( df artist track , on=’track id’, how=’left’)

df = df.join( df artist , on=’ artist id ’ , how=’left’)

df = df. join( df release , on=’release id ’ , how=’left’)

df = df.join( df audio features , on=’isrc’ , how=’left’)

# Drop last unnecessary columns

df = df.drop(

[ ’ isrc ’ , ’track number’, ’ release id ’ , ’disc number’,

’preview url’ , ’ artist id ’ , ’upc’]

). collect ()

Listing 62: Polars Out Of Core Data Ingestion All Code

# Turn miliseconds into seconds

df = df.with columns(

duration = pl.col( ’duration ms’) ∗10∗∗−3

).drop(’duration ms’)

# Turn string into boolean

df = df.with columns(

explicit = pl.col( ’ explicit ’ ). replace strict ({’ t’ : True, ’ f ’ : False})

)

# Fill dates and turn to datetime

df = df.with columns(

release date = pl.when(

pl . col( ’ release date ’ ). str .count matches(’−’) == 0

).then(

pl . col( ’ release date ’ ) + ’−01−01 00:00:00’

).otherwise(

pl .when(

pl . col( ’ release date ’ ). str .count matches(’−’) == 1

).then(

pl . col( ’ release date ’ ) + ’−01 00:00:01’

).otherwise(

pl . col( ’ release date ’ ) + ’ 00:00:02’

)

). str .to datetime(’%Y−%m−%d %H:%M:%S’)

)

# Find all artist names for tracks

and combine them then join it with the original dataframe

df = df.group by(’track id ’ ). agg(

artist name all = pl.col (’artist name’)

).with columns(

artist name all = pl.col(’ artist name all ’ ). list . join(separator=’,’ )

). join(

df, on=’track id’

). collect ()

Listing 63: Polars Out Of Core Data Wrangling All Code

Mathias Asplin Big data analytics and processing

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/PolarsOutOfCore.csv’

path pyarrow = ’path/to/input/PolarsOutOfCore.parquet’

Listing 64: Polars Out Of Core Data I/O Paths Code

# Write to csv

df. sink csv(path csv, separator=’;’)

Listing 65: Polars Out Of Core Data I/O Write CSV Code

# Write to parquet (rust)

df.sink parquet(path parquet)

Listing 66: Polars Out Of Core Data I/O Write Parquet Code

# Lazy read from csv

df = pl.scan csv(path csv, separator=’;’ ). collect ()

Listing 67: Polars Out Of Core Data I/O Read CSV Code

# Lazy read from parquet

df = pl.scan parquet(path parquet).collect ()

Listing 68: Polars Out Of Core Data I/O Read Parquet Code

# Lazy read from parquet

path to parquet = ’path/to/parquet/∗.parquet’

df = pl.scan parquet(path to parquet)

Listing 69: Polars Out Of Core Data Analytics/Exploration Lazy Read Parquet Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

df. ﬁlter (

(˜pl. col( ’ artist name all ’ ). str .contains any( artists )) &

(pl . col(’ release date ’ ) >= datetime(2010, 1, 1)) &

(pl . col(’ release date ’ ) <datetime(2020, 1, 1)) &

(˜pl. col( ’ explicit ’ )) &

(pl . col( ’duration’) >= 3∗60) &

((pl . col( ’ danceability ’) >= 0.8) |( pl . col ( ’energy’) >= 0.8)) &

(pl . col( ’tempo’) >= pl.col(’tempo’).mean()) &

(pl . col( ’ liveness ’ ) <= 0.2)

). sort(’popularity’ , descending=True).head(10).collect()

Listing 70: Polars Out Of Core Data Analytics/Exploration Task 1 Code

# Find the artists who has made the most tracks

df .group by(’artist name’ ). len (). sort (’ len ’ , descending=True).collect()

Listing 71: Polars Out Of Core Data Analytics/Exploration Task 2 Code

# Find the artist that has made the most popular songs on average

df .group by(’artist name’ ). agg(

most popular = pl.col(’popularity’ ). mean()

). sort( ’most popular’, descending=True).collect()

Listing 72: Polars Out Of Core Data Analytics/Exploration Task 3 Code

Mathias Asplin Big data analytics and processing

# Find the artist who on average makes the most danceable tracks

df .group by(’artist name’ ). agg(

best dance = pl.col( ’ danceability ’ ). mean()

). sort( ’best dance’, descending=True).collect()

Listing 73: Polars Out Of Core Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

df .group by(’artist name’ ). agg(

long song = pl.col (’duration’ ). mean()

). sort( ’ long song’ , descending=True).collect()

Listing 74: Polars Out Of Core Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

df .group by(’artist name’ ). agg(

happy song = pl.col(’valence’ ). mean()

). sort( ’happy song’, descending=True).collect()

Listing 75: Polars Out Of Core Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

df .group by(’artist name’ ). agg(

collaborations = pl.col(’label name’).unique().len ()

). sort( ’ collaborations ’ , descending=True).collect()

Listing 76: Polars Out Of Core Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

(

df. ﬁlter ( pl . col ( ’ explicit ’ )). group by(’artist name’)

.len (). sort (’ len ’ , descending=True).collect()

)

Listing 77: Polars Out Of Core Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

df.with columns(

year = pl.col (’ release date ’ ).dt.year()

). group by(’year’ ). agg(

most popular year = pl.col(’popularity’ ). mean()

). sort(

’most popular year’, descending=True

). collect ()

Listing 78: Polars Out Of Core Data Analytics/Exploration Task 9 Code

# Find the artist with the most popular tracks on average per year

df.with columns(

year = pl.col (’ release date ’ ).dt.year()

).group by([’year’ , ’artist name’ ]). agg(

most popular artist year = pl.col( ’popularity’ ).mean()

). sort(

’ most popular artist year ’ , descending=True

).unique(

’year’ , keep=’ﬁrst ’

). sort ( ’ most popular artist year ’ , descending=True).collect()

Listing 79: Polars Out Of Core Data Analytics/Exploration Task 10 Code

# Find the artist with the most track credits

df.unique(’track id ’ ). select (

Mathias Asplin Big data analytics and processing

pl . col ( ’ track title ’ ),

artist credits = pl.col(’ artist name all ’ ). str .count matches(’,’) + 1

). sort( ’ artist credits ’ , descending=True).head(10).collect()

Listing 80: Polars Out Of Core Data Analytics/Exploration Task 11 Code

# Find the artist most active throughout the years

df .group by(’artist name’ ). agg(

pl . col( ’ release date ’ ). max() −pl.col(’ release date ’ ). min()

). sort( ’ release date ’ , descending=True).collect()

Listing 81: Polars Out Of Core Data Analytics/Exploration Task 12 Code

# Find the album type distribution

(

df.group by(’album type’).len().with columns(

(pl . col( ’ len ’ ) / pl .sum(’len’ )). alias ( ’percent’)

). collect ()

)

Listing 82: Polars Out Of Core Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

df.group by(’album type’).agg(pl.col( ’popularity’ ). mean()).collect ()

Listing 83: Polars Out Of Core Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

df. select (

pl .mean(’danceability’, ’energy’, ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’)

). collect ()

Listing 84: Polars Out Of Core Data Analytics/Exploration Task 15 Code

# Find the median of diﬀerent measurements

df. select (

pl .median(’danceability’ , ’energy’, ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’)

). collect ()

Listing 85: Polars Out Of Core Data Analytics/Exploration Task 16 Code

# Find the standard deviation of diﬀerent measurements

df. select (

pl .std( ’ danceability ’ ),

pl . std( ’ energy’ ),

pl . std( ’ liveness ’ ),

pl . std( ’ speechiness ’ ),

pl . std( ’ tempo’),

pl . std( ’ valence ’ ),

pl . std( ’ duration’)

). collect ()

Listing 86: Polars Out Of Core Data Analytics/Exploration Task 17 Code

# Find out what makes a track popular

df.group by(’popularity’ ). agg(

pl . col( ’ danceability ’ ). mean(),

pl . col ( ’energy’ ). mean(),

pl . col ( ’ liveness ’ ). mean(),

pl . col ( ’ speechiness’ ). mean(),

pl . col ( ’tempo’).mean(),

Mathias Asplin Big data analytics and processing

pl . col ( ’valence’ ). mean(),

pl . col ( ’duration’ ). mean()

). sort(’popularity’ , descending=True).collect()

Listing 87: Polars Out Of Core Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

E Appendix Duckdb Code

# Connect to the database

con = duckdb.connect()

Listing 88: Duckdb Data Ingestion Read Code

engine = [’python’, ’c’ , ’pyarrow’ ][0]

dtype backend = [’numpy nullable’, ’pyarrow’][0]

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/ sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

con.sql (f”””

CREATE TABLE track AS

SELECT ∗EXCLUDE (updated on) FROM read csv(’{path to sp track}’);

CREATE TABLE artist track AS

SELECT ∗EXCLUDE (updated on) FROM read csv(’{path to sp artist track}’);

CREATE TABLE artist AS

SELECT ∗EXCLUDE (updated on) FROM read csv(’{path to sp artist}’);

CREATE TABLE release AS

SELECT ∗EXCLUDE (release img, updated on) FROM

read csv(’{path to sp release }’);

CREATE TABLE audio features AS

SELECT ∗EXCLUDE (duration ms, updated on) FROM

read csv(’{path to audio features }’);

”””)

Listing 89: Duckdb Data Ingestion Read Code

# Merge/Join the columns to create one large table

con.sql (f”””

CREATE TABLE df AS

(

SELECT ∗FROM track

LEFT JOIN artist track ON track.track id = artist track.track id

LEFT JOIN artist ON artist track.artist id = artist . artist id

LEFT JOIN release ON track.release id = release.release id

LEFT JOIN audio features ON track.isrc = audio features.isrc

);

ALTER TABLE df DROP COLUMN track id 1;

ALTER TABLE df DROP COLUMN artist id 1;

ALTER TABLE df DROP COLUMN release id 1;

ALTER TABLE df DROP COLUMN isrc 1;

ALTER TABLE df DROP COLUMN isrc;

ALTER TABLE df DROP COLUMN track number;

ALTER TABLE df DROP COLUMN release id;

ALTER TABLE df DROP COLUMN disc number;

ALTER TABLE df DROP COLUMN preview url;

ALTER TABLE df DROP COLUMN artist id;

ALTER TABLE df DROP COLUMN upc;

”””)

Listing 90: Duckdb Data Ingestion Merge/Join Code

# Turn miliseconds into seconds

con.sql (f”””

ALTER TABLE df RENAME duration ms TO duration;

ALTER TABLE df ALTER duration TYPE DOUBLE;

UPDATE df SET duration = duration ∗0.001;

”””)

Listing 91: Duckdb Data Wrangling Duration Code

Mathias Asplin Big data analytics and processing

# Turn string into boolean

# Already interpreted as boolean

Listing 92: Duckdb Data Wrangling Boolean Code

# Fill dates and turn to datetime

con.sql (f”””

CREATE OR REPLACE TABLE df AS

(SELECT ∗, (LENGTH(release date) −

LENGTH(replace(release date, ’−’, ’’)))/LENGTH(’−’) AS

Count FROM df);

CREATE OR REPLACE TABLE df AS

(SELECT ∗, CASE WHEN Count = 0 THEN concat(release date, ’−01−01 00:00:00’)

WHEN Count = 1 THEN concat(release date, ’−01 00:00:01’) ELSE

concat(release date , ’ 00:00:02’) END AS release date new FROM df);

ALTER TABLE df ALTER release date new TYPE DATETIME;

ALTER TABLE df DROP COLUMN release date;

ALTER TABLE df DROP COLUMN Count;

ALTER TABLE df RENAME release date new TO release date;

”””)

Listing 93: Duckdb Data Wrangling Datetime Code

# Find all artist names for tracks

# and combine them then join it with the original dataframe

con.sql (f”””

CREATE OR REPLACE TEMP VIEW temp AS

SELECT track id, list string agg( list (artist name)) AS

artist name all FROM df

GROUP BY track id;

CREATE OR REPLACE TABLE df AS

SELECT ∗FROM df

LEFT JOIN

temp ON df.track id = temp.track id;

ALTER TABLE df DROP COLUMN track id 1;

DROP VIEW temp

”””)

Listing 94: Duckdb Data Wrangling Artists Code

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/Duckdb.csv’

path pyarrow = ’path/to/input/Duckdb pa.parquet’

Listing 95: Duckdb Data I/O Paths Code

# Write to CSV

con.sql (f”””

COPY df TO ’{path csv}’ (HEADER, DELIMITER ’;’)

”””)

Listing 96: Duckdb Data I/O Write CSV Code

# Write to parquet (pyarrow)

con.sql (f”””

COPY df TO ’{path pyarrow}’ (COMPRESSION ’snappy’)

”””)

Listing 97: Duckdb Data I/O Write Parquet Code

# Read from csv

con.sql (f”””

CREATE OR REPLACE TABLE df AS

Mathias Asplin Big data analytics and processing

SELECT ∗FROM read csv(’{path csv}’, DELIM = ’;’, HEADER = true)

”””)

Listing 98: Duckdb Data I/O Read CSV Code

# Read from parquet (pyarrow)

con.sql (f”””

CREATE OR REPLACE TABLE df AS

SELECT ∗FROM read parquet(’{path pyarrow}’)

”””)

Listing 99: Duckdb Data I/O Read Parquet Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

print(

con.sql (f”””

SELECT

∗

FROM

WHERE

NOT regexp matches(artist name all, ’{(’|’). join ( artists )}’)

AND release date >= ’2010−1−1’

AND release date <’2020−1−1’

AND explicit

AND duration >= 3∗60

AND (danceability >= 0.8 OR energy >= 0.8)

AND tempo >= (SELECT mean(tempo) FROM df)

AND liveness <= 0.2

ORDER BY

popularity DESC

LIMIT 10

”””).show(max width=125)

)

Listing 100: Duckdb Data Analytics/Exploration Task 1 Code

# Find the artists who has made the most tracks

print(

con.sql (f”””

SELECT

artist name, COUNT(∗) AS Count

FROM

GROUP BY

artist name

ORDER BY

Count DESC

”””)

)

Listing 101: Duckdb Data Analytics/Exploration Task 2 Code

# Find the artist that has made the most popular songs on average

print(

con.sql (f”””

SELECT

artist name, mean(popularity) AS Mean

FROM

GROUP BY

Mathias Asplin Big data analytics and processing

artist name

ORDER BY

Mean DESC

”””)

)

Listing 102: Duckdb Data Analytics/Exploration Task 3 Code

# Find the artist who on average makes the most danceable tracks

print(

con.sql (f”””

SELECT

artist name, mean(danceability) AS Mean

FROM

GROUP BY

artist name

ORDER BY

Mean DESC

”””)

)

Listing 103: Duckdb Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

print(

con.sql (f”””

SELECT

artist name, mean(duration) AS Mean

FROM

GROUP BY

artist name

ORDER BY

Mean DESC

”””)

)

Listing 104: Duckdb Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

print(

con.sql (f”””

SELECT

artist name, mean(valence) AS Mean

FROM

GROUP BY

artist name

ORDER BY

Mean DESC

”””)

)

Listing 105: Duckdb Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

print(

con.sql (f”””

SELECT

artist name, COUNT(DISTINCT label name) AS Collaborations

FROM

GROUP BY

artist name

ORDER BY

Mathias Asplin Big data analytics and processing

Collaborations DESC

”””)

)

Listing 106: Duckdb Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

print(

con.sql (f”””

SELECT

artist name, COUNT(∗) AS Profanity

FROM

WHERE

explicit

GROUP BY

artist name

ORDER BY

Profanity DESC

”””)

)

Listing 107: Duckdb Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

print(

con.sql (f”””

SELECT

date part (’year ’, release date ) AS Year, mean(popularity) AS Mean

FROM

GROUP BY

date part (’year ’, release date )

ORDER BY

Mean DESC

”””)

)

Listing 108: Duckdb Data Analytics/Exploration Task 9 Code

# Find the artist with the most popular tracks on average per year

print(

con.sql (f”””

SELECT

Year, artist name, Popularity

FROM

(SELECT

∗, ROW NUMBER() OVER (PARTITION BY Year

ORDER BY Popularity DESC) AS RN

FROM

(SELECT

date part (’year ’, release date ) AS

Year, artist name, mean(popularity) AS Popularity

FROM

GROUP BY

date part (’year ’, release date ), artist name

ORDER BY

Popularity DESC))

WHERE

RN = 1

ORDER BY

Popularity DESC

”””)

)

Listing 109: Duckdb Data Analytics/Exploration Task 10 Code

Mathias Asplin Big data analytics and processing

# Find the artist with the most track credits

print(

con.sql (f”””

SELECT

track title , (LENGTH(artist name all) −

LENGTH(replace(artist name all, ’,’, ’’)))/ LENGTH(’,’) + 1 AS Count

FROM

(SELECT

∗, ROW NUMBER() OVER (PARTITION BY track id ORDER BY

track id DESC) AS RN

FROM

df)

WHERE

RN = 1

ORDER BY

Count DESC

LIMIT 10

”””).show(max width=125)

)

Listing 110: Duckdb Data Analytics/Exploration Task 11 Code

# Find the artist most active throughout the years

print(

con.sql (f”””

SELECT

artist name, (max(release date) −min(release date)) AS Longest

FROM

GROUP BY

artist name

ORDER BY

Longest DESC

”””)

)

Listing 111: Duckdb Data Analytics/Exploration Task 12 Code

# Find the album type distribution

print(

con.sql (f”””

SELECT

album type, Count/(SELECT COUNT(∗) FROM df) AS Percentage

FROM

(SELECT

album type, COUNT(∗) AS Count

FROM

GROUP BY

album type)

”””)

)

Listing 112: Duckdb Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

print(

con.sql (f”””

SELECT

album type, mean(popularity) AS Popularity

FROM

GROUP BY

album type

”””)

)

Mathias Asplin Big data analytics and processing

Listing 113: Duckdb Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

print(

con.sql (f”””

SELECT

round(mean(danceability), 6) AS Danceability,

round(mean(energy), 6) AS Energy,

round(mean(liveness), 6) AS Liveness,

round(mean(speechiness), 6) AS Speechiness,

round(mean(tempo), 6) AS Tempo,

round(mean(valence), 6) AS Valence,

round(mean(duration), 6) AS Duration

FROM

”””)

)

Listing 114: Duckdb Data Analytics/Exploration Task 15 Code

# Find the median of diﬀerent measurements

print(

con.sql (f”””

SELECT

round(median(danceability), 6) AS Danceability,

round(median(energy), 6) AS Energy,

round(median(liveness), 6) AS Liveness,

round(median(speechiness), 6) AS Speechiness,

round(median(tempo), 6) AS Tempo,

round(median(valence), 6) AS Valence,

round(median(duration), 6) AS Duration

FROM

”””)

)

Listing 115: Duckdb Data Analytics/Exploration Task 16 Code

# Find the standard deviation of diﬀerent measurements

print(

con.sql (f”””

SELECT

round(stddev(danceability), 6) AS Danceability,

round(stddev(energy), 6) AS Energy,

round(stddev(liveness), 6) AS Liveness,

round(stddev(speechiness), 6) AS Speechiness,

round(stddev(tempo), 6) AS Tempo,

round(stddev(valence), 6) AS Valence,

round(stddev(duration), 6) AS Duration

FROM

”””)

)

Listing 116: Duckdb Data Analytics/Exploration Task 17 Code

# Find out what makes a track popular

print(

con.sql (f”””

SELECT

popularity,

round(mean(danceability), 6) AS Danceability,

round(mean(energy), 6) AS Energy,

round(mean(liveness), 6) AS Liveness,

round(mean(speechiness), 6) AS Speechiness,

Mathias Asplin Big data analytics and processing

round(mean(tempo), 6) AS Tempo,

round(mean(valence), 6) AS Valence,

round(mean(duration), 6) AS Duration

FROM

GROUP BY

popularity

ORDER BY

popularity DESC

”””)

)

Listing 117: Duckdb Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

F Appendix Duckdb View Code

# Connect to the database

con = duckdb.connect()

Listing 118: Duckdb Data Ingestion Read Code

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/ sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

# View does not support dropping columns

# So this is not included in the tests

Listing 119: Duckdb View Data Ingestion Read Code

# Merge/Join the columns to create one large table

# View does not support dropping columns

# So this is not included in the tests

Listing 120: Duckdb View Data Ingestion Merge/Join Code

# Turn miliseconds into seconds

# View does not support dropping columns

# So this is not included in the tests

Listing 121: Duckdb View Data Wrangling Duration Code

# Turn string into boolean

# View does not support dropping columns

# So this is not included in the tests

Listing 122: Duckdb View Data Wrangling Boolean Code

# Fill dates and turn to datetime

# View does not support dropping columns

# So this is not included in the tests

Listing 123: Duckdb View Data Wrangling Datetime Code

# Find all artist names for tracks

# and combine them then join it with the original dataframe

# View does not support dropping columns

# So this is not included in the tests

Listing 124: Duckdb View Data Wrangling Artists Code

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/DuckdbView.csv’

path pyarrow = ’path/to/input/DuckdbView pa.parquet’

Listing 125: Duckdb View Data I/O Paths Code

Mathias Asplin Big data analytics and processing

# Write to CSV

# View does not support dropping columns

# So this is not included in the tests

Listing 126: Duckdb View Data I/O Write CSV Code

# Write to parquet (pyarrow)

# View does not support dropping columns

# So this is not included in the tests

Listing 127: Duckdb View Data I/O Write Parquet Code

# Read from csv

# View does not support dropping columns

# So this is not included in the tests

Listing 128: Duckdb View Data I/O Read CSV Code

# Read from parquet (pyarrow)

# View does not support dropping columns

# So this is not included in the tests

Listing 129: Duckdb View Data I/O Read Parquet Code

# Lazy read from parquet

path to parquet = ’path/to/parquet/∗.parquet’

con.sql (f”””

CREATE OR REPLACE VIEW df AS

SELECT ∗FROM read parquet({path to parquet})

”””)

Listing 130: Duckdb View Data Analytics/Exploration Lazy Read Parquet Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

print(

con.sql (f”””

SELECT

∗

FROM

WHERE

NOT regexp matches(artist name all, ’{(’|’). join ( artists )}’)

AND release date >= ’2010−1−1’

AND release date <’2020−1−1’

AND explicit

AND duration >= 3∗60

AND (danceability >= 0.8 OR energy >= 0.8)

AND tempo >= (SELECT mean(tempo) FROM df)

AND liveness <= 0.2

ORDER BY

popularity DESC

LIMIT 10

”””).show(max width=125)

)

Listing 131: Duckdb View Data Analytics/Exploration Task 1 Code

Mathias Asplin Big data analytics and processing

# Find the artists who has made the most tracks

print(

con.sql (f”””

SELECT

artist name, COUNT(∗) AS Count

FROM

GROUP BY

artist name

ORDER BY

Count DESC

”””)

)

Listing 132: Duckdb View Data Analytics/Exploration Task 2 Code

# Find the artist that has made the most popular songs on average

print(

con.sql (f”””

SELECT

artist name, mean(popularity) AS Mean

FROM

GROUP BY

artist name

ORDER BY

Mean DESC

”””)

)

Listing 133: Duckdb View Data Analytics/Exploration Task 3 Code

# Find the artist who on average makes the most danceable tracks

print(

con.sql (f”””

SELECT

artist name, mean(danceability) AS Mean

FROM

GROUP BY

artist name

ORDER BY

Mean DESC

”””)

)

Listing 134: Duckdb View Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

print(

con.sql (f”””

SELECT

artist name, mean(duration) AS Mean

FROM

GROUP BY

artist name

ORDER BY

Mean DESC

”””)

)

Listing 135: Duckdb View Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

print(

Mathias Asplin Big data analytics and processing

con.sql (f”””

SELECT

artist name, mean(valence) AS Mean

FROM

GROUP BY

artist name

ORDER BY

Mean DESC

”””)

)

Listing 136: Duckdb View Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

print(

con.sql (f”””

SELECT

artist name, COUNT(DISTINCT label name) AS Collaborations

FROM

GROUP BY

artist name

ORDER BY

Collaborations DESC

”””)

)

Listing 137: Duckdb View Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

print(

con.sql (f”””

SELECT

artist name, COUNT(∗) AS Profanity

FROM

WHERE

explicit

GROUP BY

artist name

ORDER BY

Profanity DESC

”””)

)

Listing 138: Duckdb View Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

print(

con.sql (f”””

SELECT

date part (’year ’, release date ) AS Year, mean(popularity) AS Mean

FROM

GROUP BY

date part (’year ’, release date )

ORDER BY

Mean DESC

”””)

)

Listing 139: Duckdb View Data Analytics/Exploration Task 9 Code

# Find the artist with the most popular tracks on average per year

df [ ’ year’ ] = df. release date . dt.year

Mathias Asplin Big data analytics and processing

print(

con.sql (f”””

SELECT

Year, artist name, Popularity

FROM

(SELECT

∗, ROW NUMBER() OVER (PARTITION BY Year

ORDER BY Popularity DESC) AS RN

FROM

(SELECT

date part (’year ’, release date ) AS

Year, artist name, mean(popularity) AS Popularity

FROM

GROUP BY

date part (’year ’, release date ), artist name

ORDER BY

Popularity DESC))

WHERE

RN = 1

ORDER BY

Popularity DESC

”””)

)

Listing 140: Duckdb View Data Analytics/Exploration Task 10 Code

# Find the artist with the most track credits

print(

con.sql (f”””

SELECT

track title , (LENGTH(artist name all) −

LENGTH(replace(artist name all, ’,’, ’’)))/ LENGTH(’,’) + 1 AS Count

FROM

(SELECT

∗, ROW NUMBER() OVER (PARTITION BY track id

ORDER BY track id DESC) AS RN

FROM

df)

WHERE

RN = 1

ORDER BY

Count DESC

LIMIT 10

”””).show(max width=125)

)

Listing 141: Duckdb View Data Analytics/Exploration Task 11 Code

# Find the artist most active throughout the years

print(

con.sql (f”””

SELECT

artist name, (max(release date) −min(release date)) AS Longest

FROM

GROUP BY

artist name

ORDER BY

Longest DESC

”””)

)

Listing 142: Duckdb View Data Analytics/Exploration Task 12 Code

# Find the album type distribution

print(

Mathias Asplin Big data analytics and processing

con.sql (f”””

SELECT

album type, Count/(SELECT COUNT(∗) FROM df) AS Percentage

FROM

(SELECT

album type, COUNT(∗) AS Count

FROM

GROUP BY

album type)

”””)

)

Listing 143: Duckdb View Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

print(

con.sql (f”””

SELECT

album type, mean(popularity) AS Popularity

FROM

GROUP BY

album type

”””)

)

Listing 144: Duckdb View Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

print(

con.sql (f”””

SELECT

round(mean(danceability), 6) AS Danceability,

round(mean(energy), 6) AS Energy,

round(mean(liveness), 6) AS Liveness,

round(mean(speechiness), 6) AS Speechiness,

round(mean(tempo), 6) AS Tempo,

round(mean(valence), 6) AS Valence,

round(mean(duration), 6) AS Duration

FROM

”””)

)

Listing 145: Duckdb View Data Analytics/Exploration Task 15 Code

# Find the median of diﬀerent measurements

print(

con.sql (f”””

SELECT

round(median(danceability), 6) AS Danceability,

round(median(energy), 6) AS Energy,

round(median(liveness), 6) AS Liveness,

round(median(speechiness), 6) AS Speechiness,

round(median(tempo), 6) AS Tempo,

round(median(valence), 6) AS Valence,

round(median(duration), 6) AS Duration

FROM

”””)

)

Listing 146: Duckdb View Data Analytics/Exploration Task 16 Code

# Find the standard deviation of diﬀerent measurements

Mathias Asplin Big data analytics and processing

print(

con.sql (f”””

SELECT

round(stddev(danceability), 6) AS Danceability,

round(stddev(energy), 6) AS Energy,

round(stddev(liveness), 6) AS Liveness,

round(stddev(speechiness), 6) AS Speechiness,

round(stddev(tempo), 6) AS Tempo,

round(stddev(valence), 6) AS Valence,

round(stddev(duration), 6) AS Duration

FROM

”””)

)

Listing 147: Duckdb View Data Analytics/Exploration Task 17 Code

# Find out what makes a track popular

print(

con.sql (f”””

SELECT

popularity,

round(mean(danceability), 6) AS Danceability,

round(mean(energy), 6) AS Energy,

round(mean(liveness), 6) AS Liveness,

round(mean(speechiness), 6) AS Speechiness,

round(mean(tempo), 6) AS Tempo,

round(mean(valence), 6) AS Valence,

round(mean(duration), 6) AS Duration

FROM

GROUP BY

popularity

ORDER BY

popularity DESC

”””)

)

Listing 148: Duckdb View Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

G Appendix Daft In Memory Code

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/ sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

df track = ft.read csv(

path to sp track

). exclude(’updated on’). collect ()

df artist track = ft.read csv

( path to sp artist track

). exclude(’updated on’). collect ()

df artist = ft.read csv(

path to sp artist

). exclude(’updated on’). collect ()

df release = ft.read csv(

path to sp release

). exclude(’ release img ’ , ’ updated on’). collect ()

df audio features = ft.read csv(

path to audio features

). exclude(’duration ms’, ’updated on’). collect ()

Listing 149: Daft In Memory Data Ingestion Read Code

# Merge/Join the columns to create one large table

df = df track. join ( df artist track , on=’track id’, how=’left’). collect ()

df = df.join( df artist , on=’ artist id ’ , how=’left’ ). collect ()

df = df. join( df release , on=’release id ’ , how=’left’). collect ()

df = df.join( df audio features , on=’isrc’ , how=’left’). collect ()

# Drop last unnecessary columns

(

df = df.exclude(’ isrc ’ , ’track number’, ’ release id ’ ,

’disc number’, ’preview url’ , ’ artist id ’ , ’upc’). collect ()

)

Listing 150: Daft In Memory Data Ingestion Merge/Join Code

# Turn miliseconds into seconds

df = df.with columns(

{

’duration’: ft . col( ’duration ms’) ∗10∗∗−3

}

).exclude(’duration ms’). collect ()

Listing 151: Daft In Memory Data Wrangling Duration Code

# Turn string into boolean

df = df.with columns(

{

’ explicit ’ : ft . col(’ explicit ’ ) == ’t’

}

). collect ()

Listing 152: Daft In Memory Data Wrangling Boolean Code

# Fill dates and turn to datetime

df = df.with columns(

{

’ release date ’ : ( ft . col( ’ release date ’ ). str .count matches(’−’) == 0)

. if else ( ft . col( ’ release date ’ ) + ’−01−01 00:00:00’,

( ft . col( ’ release date ’ ). str .count matches(’−’) == 1)

Mathias Asplin Big data analytics and processing

. if else ( ft . col( ’ release date ’ ) + ’−01 00:00:01’,

ft . col(’ release date ’ ) + ’ 00:00:02 ’ ))

}

)

df = df.with columns(

{

’ release date ’ : ft . col( ’ release date ’ ). str .to datetime(’%Y−%m−%d %H:%M:%S’)

}

). collect ()

Listing 153: Daft In Memory Data Wrangling Datetime Code

# Find all artist names for tracks

# and combine them then join it with the original dataframe

df = df.groupby(’track id’ ). agg list (

ft . col (’artist name’ ). alias ( ’ artist name all ’ )

).with columns(

{

’ artist name all ’ : ft . col(’ artist name all ’ ). list . join( delimiter=’,’)

}

). join(

df, on=’track id’

). collect ()

Listing 154: Daft In Memory Data Wrangling Artists Code

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/DaftInMemory.csv’

path pyarrow = ’path/to/input/DaftInMemory pa.parquet’

Listing 155: Daft In Memory Data I/O Paths Code

# Write to CSV

df. write csv(path csv)

Listing 156: Daft In Memory Data I/O Write CSV Code

# Write to parquet (pyarrow)

df.write parquet(path pyarrow, compression=’snappy’)

Listing 157: Daft In Memory Data I/O Write Parquet Code

# Read from csv

df = ft.read csv(path csv, delimiter=’,’ ). collect ()

Listing 158: Daft In Memory Data I/O Read CSV Code

# Read from parquet (pyarrow)

df = ft.read parquet(path pyarrow).collect()

Listing 159: Daft In Memory Data I/O Read Parquet Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

# Since Daft is lazy only it is only tested for the lazy version .

Listing 160: Daft In Memory Data Analytics/Exploration Task 1 Code

Mathias Asplin Big data analytics and processing

# Find the artists who has made the most tracks

# Since Daft is lazy only it is only tested for the lazy version .

Listing 161: Daft In Memory Data Analytics/Exploration Task 2 Code

# Find the artist that has made the most popular songs on average

# Since Daft is lazy only it is only tested for the lazy version .

Listing 162: Daft In Memory Data Analytics/Exploration Task 3 Code

# Find the artist who on average makes the most danceable tracks

# Since Daft is lazy only it is only tested for the lazy version .

Listing 163: Daft In Memory Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

# Since Daft is lazy only it is only tested for the lazy version .

Listing 164: Daft In Memory Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

# Since Daft is lazy only it is only tested for the lazy version .

Listing 165: Daft In Memory Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

# Since Daft is lazy only it is only tested for the lazy version .

Listing 166: Daft In Memory Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

# Since Daft is lazy only it is only tested for the lazy version .

Listing 167: Daft In Memory Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

# Since Daft is lazy only it is only tested for the lazy version .

Listing 168: Daft In Memory Data Analytics/Exploration Task 9 Code

# Find the artist with the most popular tracks on average per year

df [ ’ year’ ] = df. release date . dt.year

# Since Daft is lazy only it is only tested for the lazy version .

Listing 169: Daft In Memory Data Analytics/Exploration Task 10 Code

# Find the artist with the most track credits

# Since Daft is lazy only it is only tested for the lazy version .

Listing 170: Daft In Memory Data Analytics/Exploration Task 11 Code

Mathias Asplin Big data analytics and processing

# Find the artist most active throughout the years

# Since Daft is lazy only it is only tested for the lazy version .

Listing 171: Daft In Memory Data Analytics/Exploration Task 12 Code

# Find the album type distribution

# Since Daft is lazy only it is only tested for the lazy version .

Listing 172: Daft In Memory Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

# Since Daft is lazy only it is only tested for the lazy version .

Listing 173: Daft In Memory Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

# Since Daft is lazy only it is only tested for the lazy version .

Listing 174: Daft In Memory Data Analytics/Exploration Task 15 Code

# Find the median of diﬀerent measurements

# Since Daft is lazy only it is only tested for the lazy version .

Listing 175: Daft In Memory Data Analytics/Exploration Task 16 Code

# Find the standard deviation of diﬀerent measurements

# Since Daft is lazy only it is only tested for the lazy version .

Listing 176: Daft In Memory Data Analytics/Exploration Task 17 Code

# Find out what makes a track popular

# Since Daft is lazy only it is only tested for the lazy version .

Listing 177: Daft In Memory Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

H Appendix Daft Out Of Core Code

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/ sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

df track = ft.read csv(

path to sp track

).exclude(’updated on’)

df artist track = ft.read csv(

path to sp artist track

).exclude(’updated on’)

df artist = ft.read csv(

path to sp artist

).exclude(’updated on’)

df release = ft.read csv(

path to sp release

). exclude(’ release img ’ , ’ updated on’)

df audio features = ft.read csv(

path to audio features

).exclude(’duration ms’, ’updated on’)

# Merge/Join the columns to create one large table

df = df track. join ( df artist track , on=’track id’, how=’left’)

df = df.join( df artist , on=’ artist id ’ , how=’left’)

df = df. join( df release , on=’release id ’ , how=’left’)

df = df.join( df audio features , on=’isrc’ , how=’left’)

# Drop last unnecessary columns

(

df = df.exclude(’ isrc ’ , ’track number’, ’ release id ’ ,

’disc number’, ’preview url’ , ’ artist id ’ , ’upc’). collect ()

)

Listing 178: Daft Out Of Core Data Ingestion All Code

# Turn miliseconds into seconds

df = df.with columns(

{

’duration’: ft . col( ’duration ms’) ∗10∗∗−3

}

).exclude(’duration ms’)

# Turn string into boolean

df = df.with columns(

{

’ explicit ’ : ft . col(’ explicit ’ ) == ’t’

}

)

# Fill dates and turn to datetime

df = df.with columns(

{

’ release date ’ : ( ft . col( ’ release date ’ ). str .count matches(’−’) == 0)

. if else ( ft . col( ’ release date ’ ) + ’−01−01 00:00:00’,

( ft . col( ’ release date ’ ). str .count matches(’−’) == 1)

. if else ( ft . col( ’ release date ’ ) + ’−01 00:00:01’,

ft . col(’ release date ’ ) + ’ 00:00:02 ’ ))

}

)

df = df.with columns(

{

’ release date ’ : ft . col( ’ release date ’ ). str .to datetime(’%Y−%m−%d %H:%M:%S’)

}

)

Mathias Asplin Big data analytics and processing

# Find all artist names for tracks

# and combine them then join it with the original dataframe

df = df.groupby(’track id’ ). agg list (

ft . col (’artist name’ ). alias ( ’ artist name all ’ )

).with columns(

{

’ artist name all ’ : ft . col(’ artist name all ’ ). list . join( delimiter=’,’)

}

). join(

df, on=’track id’

). collect ()

Listing 179: Daft Out Of Core Data Wrangling Duration Code

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/DaftOutOfCore’

path pyarrow = ’path/to/input/DaftOutOfCore pa’

Listing 180: Daft Out Of Core Data I/O Paths Code

# Write to CSV

df. write csv(path csv)

Listing 181: Daft Out Of Core Data I/O Write CSV Code

# Write to parquet (pyarrow)

df.write parquet(path pyarrow, compression=’snappy’)

Listing 182: Daft Out Of Core Data I/O Write Parquet Code

# Read from csv

df = ft.read csv(path csv, delimiter=’,’ ). collect ()

Listing 183: Daft Out Of Core Data I/O Read CSV Code

# Read from parquet (pyarrow)

df = ft.read parquet(f’{path pyarrow}/∗.parquet’).collect ()

Listing 184: Daft Out Of Core Data I/O Read Parquet Code

# Lazy read from parquet

path to parquet = ’path/to/parquet/∗.parquet’

df = ft.read parquet(path to parquet)

Listing 185: Daft Out Of Core Data Analytics/Exploration Lazy Read Parquet Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

df. ﬁlter (

(˜ft . col( ’ artist name all ’ ). str .contains(( ’ |’ ). join ( artists ))) &

( ft . col( ’ release date ’ ) >= datetime(2010, 1, 1)) &

( ft . col( ’ release date ’ ) <datetime(2020, 1, 1)) &

(˜ft . col( ’ explicit ’ )) &

( ft . col (’ duration’) >= 3∗60) &

(( ft . col (’ danceability ’) >= 0.8) |( ft . col (’energy’) >= 0.8)) &

( ft . col (’ tempo’) >= df.select(’tempo’).agg(

ft . col( ’tempo’).mean()

). collect (). to pylist ()[0][ ’tempo’]) &

Mathias Asplin Big data analytics and processing

( ft . col (’ liveness ’) <= 0.2)

). sort(’popularity’ , desc=True).limit(10). collect ()

Listing 186: Daft Out Of Core Data Analytics/Exploration Task 1 Code

# Find the artists who has made the most tracks

df.groupby(’artist name’).agg(

ft . col( ’artist name’ ). count(). alias ( ’count’)

). sort( ’count’, desc=True).collect()

Listing 187: Daft Out Of Core Data Analytics/Exploration Task 2 Code

# Find the artist that has made the most popular songs on average

df.groupby(’artist name’).agg(

ft . col( ’popularity’ ). mean().alias( ’most popular’)

). sort( ’most popular’, desc=True).collect()

Listing 188: Daft Out Of Core Data Analytics/Exploration Task 3 Code

# Find the artist who on average makes the most danceable tracks

df.groupby(’artist name’).agg(

ft . col(’ danceability ’ ). mean().alias( ’best dance’)

). sort( ’best dance’, desc=True).collect()

Listing 189: Daft Out Of Core Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

df.groupby(’artist name’).agg(

ft . col (’duration’ ). mean().alias( ’long song’)

). sort( ’ long song’ , desc=True).collect()

Listing 190: Daft Out Of Core Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

df.groupby(’artist name’).agg(

ft . col( ’valence’ ). mean().alias( ’happy song’)

). sort( ’happy song’, desc=True).collect()

Listing 191: Daft Out Of Core Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

df.groupby(’artist name’).agg(

ft . col(’label name’).approx count distinct (). alias (’ collaborations ’)

). sort( ’ collaborations ’ , desc=True).collect()

Listing 192: Daft Out Of Core Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

df. ﬁlter ( ft . col(’ explicit ’ )). groupby(’artist name’).agg(

ft . col( ’artist name’ ). count(). alias ( ’length’ )

). sort(’length’ , desc=True).collect()

Listing 193: Daft Out Of Core Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

df.with columns(

{

’year’ : ft . col(’ release date ’ ). dt.year()

}

).groupby(’year’). agg(

ft . col( ’popularity’ ). mean().alias( ’most popular year’)

). sort(

Mathias Asplin Big data analytics and processing

’most popular year’, desc=True

). collect ()

Listing 194: Daft Out Of Core Data Analytics/Exploration Task 9 Code

# Find the artist with the most popular tracks on average per year

df [ ’ year’ ] = df. release date . dt.year

df.with columns(

{

’year’ : ft . col(’ release date ’ ). dt.year()

}

).groupby(’year’, ’artist name’ ). agg(

ft . col ( ’popularity’ ). mean().alias( ’ most popular artist year ’ )

). sort(

[ ’year’ , ’ most popular artist year ’ ], [ True, True]

). collect ()

Listing 195: Daft Out Of Core Data Analytics/Exploration Task 10 Code

# Find the artist with the most track credits

# Unique column subset is not yet implemented.

# https://github.com/Eventual−Inc/Daft/issues/3181

df. select (

ft . col (’ track title ’ ),

( ft . col (’ artist name all ’ ). str .count matches(’,’) + 1). alias ( ’ artist credits ’ )

). sort( ’ artist credits ’ , desc=True).collect()

Listing 196: Daft Out Of Core Data Analytics/Exploration Task 11 Code

# Find the artist most active throughout the years

df.groupby(’artist name’).agg(

ft . col( ’ release date ’ ). max().alias( ’max’),

ft . col( ’ release date ’ ). min(). alias ( ’min’)

).with columns(

{

’ diﬀ ’ : ft . col(’max’) −ft.col(’min’)

}

). select (’artist name’ , ’ diﬀ ’ ). sort( ’ diﬀ ’ , desc=True).collect()

Listing 197: Daft Out Of Core Data Analytics/Exploration Task 12 Code

# Find the album type distribution

df.groupby(’album type’).agg(

ft . col ( ’album type’).count(). alias ( ’ len ’ )

). select (

ft . col( ’album type’),

ft . col ( ’ len ’ ) / df . select ( ’ album type’)

.count(). collect (). to pylist ()[0][ ’count’]

). collect ()

Listing 198: Daft Out Of Core Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

df.groupby(’album type’).agg(ft.col(’popularity’ ). mean()).collect ()

Listing 199: Daft Out Of Core Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

(

df.mean(’danceability’, ’energy’, ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’ ). collect ()

)

Listing 200: Daft Out Of Core Data Analytics/Exploration Task 15 Code

Mathias Asplin Big data analytics and processing

# Find the median of diﬀerent measurements

# Not implemented

Listing 201: Daft Out Of Core Data Analytics/Exploration Task 16 Code

# Find the standard deviation of diﬀerent measurements

(

df.stddev(’danceability ’ , ’energy’, ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’ ). collect ()

)

Listing 202: Daft Out Of Core Data Analytics/Exploration Task 17 Code

# Find out what makes a track popular

df.groupby(’popularity’). agg(

ft . col(’ danceability ’ ). mean(),

ft . col ( ’energy’ ). mean(),

ft . col( ’ liveness ’ ). mean(),

ft . col( ’speechiness’ ). mean(),

ft . col( ’tempo’).mean(),

ft . col( ’valence’ ). mean(),

ft . col (’duration’ ). mean()

). sort(’popularity’ , desc=True).collect()

Listing 203: Daft Out Of Core Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

I Appendix Dask Code

path to sp track = ’path/to/sp track.csv’

path to sp artist track = ’path/to/ sp artist track .csv’

path to sp artist = ’path/to/sp artist . csv’

path to sp release = ’path/to/sp release.csv’

path to audio features = ’path/to/audio features.csv’

# Read in all of the data and drop unnecessary columns

df track = dd.read csv(

path to sp track, engine=’pyarrow’, dtype backend=’pyarrow’

).drop(’updated on’, axis=1)

df artist track = dd.read csv(

path to sp artist track , engine=’pyarrow’, dtype backend=’pyarrow’

).drop(’updated on’, axis=1)

df artist = dd.read csv(

path to sp artist , engine=’pyarrow’, dtype backend=’pyarrow’

).drop(’updated on’, axis=1)

df release = dd.read csv(

path to sp release , engine=’pyarrow’, dtype backend=’pyarrow’

). drop([ ’ release img ’ , ’updated on’], axis=1)

df audio features = dd.read csv(

path to audio features , engine=’pyarrow’, dtype backend=’pyarrow’

).drop([’duration ms’, ’updated on’], axis=1)

# Merge/Join the columns to create one large table

df = df track.merge( df artist track , on=’track id’, how=’left’)

df = df.merge( df artist , on=’artist id ’ , how=’left’)

df = df.merge(df release , on=’release id ’ , how=’left’)

df = df.merge(df audio features, on=’isrc’ , how=’left’)

# Drop last unnecessary columns

(

df = df.drop([’ isrc ’ , ’track number’, ’ release id ’ ,

’disc number’, ’preview url’ , ’ artist id ’ , ’upc’ ], axis=1)

)

# Collect as a Pandas dataframe

df = df. repartition ( npartitions=1).compute()

Listing 204: Dask Data Ingestion All Code

df = dd.from pandas(df, npartitions=8)

# Turn miliseconds into seconds

df[ ’duration’] = df.duration ms ∗10∗∗−3

df = df.drop(’duration ms’, axis=1)

# Turn string into boolean

df[ ’ explicit ’ ] = df. explicit == ’t’

# Fill dates and turn to datetime

df[ ’ release date ’ ] = df. release date .apply(

lambda x: x + ’−01−01 00:00:00’ if x.count(’−’) == 0 else

x + ’−01 00:00:01’ if x.count(’−’) == 1 else

x + ’ 00:00:02 ’ , meta=(’release date’, ’ str ’ ))

df[ ’ release date ’ ] = dd.to datetime(df.release date , format=’mixed’, errors=’coerce’)

# Find all artist names for tracks

# and combine them then join it with the original dataframe

df g = df.groupby(’track id’ )[ ’artist name’ ]. agg( list ). reset index ()

df g = df g.rename(columns={’artist name’: ’artist name all’})

df = df.merge(df g, on=’track id’)

df[ ’ artist name all ’ ] = df. artist name all . str . join(’ , ’ )

df = df.compute()

Listing 205: Dask Data Wrangling Duration Code

Mathias Asplin Big data analytics and processing

base path = ’path/to/ﬁles/∗.parquet’

path csv = ’path/to/output/Dask’

path pyarrow = ’path/to/input/Dask pa’

Listing 206: Dask Data I/O Paths Code

# Write to CSV

df . to csv(path csv, sep=’; ’)

Listing 207: Dask Data I/O Write CSV Code

# Write to parquet (pyarrow)

df.to parquet(path pyarrow, engine=’pyarrow’)

Listing 208: Dask Data I/O Write Parquet Code

# Read from csv

df = dd.read csv(f’{path csv}/∗.part’, sep=’;’ , assume missing=True).compute()

Listing 209: Dask Data I/O Read CSV Code

# Read from parquet (pyarrow)

df = dd.read parquet(f’{path pyarrow}/∗.parquet’, engine=’pyarrow’).compute()

Listing 210: Dask Data I/O Read Parquet Code

# Lazy read from parquet

path to parquet = ’path/to/parquet/∗.parquet’

df = dd.read parquet(f’{path pyarrow}/∗.parquet’, engine=’pyarrow’)

Listing 211: Dask Data Analytics/Exploration Lazy Read Parquet Code

artists = [’Travis Scott’ , ’Beyonce’, ’Juice WRLD’, ’Tyler, The Creator’]

# Find top 10 popular tracks except the artists above with tracks

# between 2010 and 2020 that does not contain profanity,

# is more than 3 minutes long, is good at dancing to or has good energy,

# with a higher tempo than mean, that is not performed live.

df[(˜df. artist name all . str .contains(( ’ |’ ). join ( artists ), case=True, na=False)) &

(df. release date >= datetime(2010, 1, 1)) &

(df. release date <datetime(2020, 1, 1)) &

(˜df . explicit ) &

(df.duration >= 3∗60) &

((df. danceability >= 0.8) |(df .energy >= 0.8)) &

(df.tempo >= df.tempo.mean()) &

(df. liveness <= 0.2)

]. sort values ( ’popularity’ , ascending=False).head(10)

Listing 212: Dask Data Analytics/Exploration Task 1 Code

# Find the artists who has made the most tracks

df.groupby(

’artist name’ ). size (). reset index (). rename(columns={0: ’len’}

). sort values ( ’ len ’ , ascending=False).compute()

Listing 213: Dask Data Analytics/Exploration Task 2 Code

# Find the artist that has made the most popular songs on average

df.groupby(’artist name’)[ ’popularity’ ]. mean().reset index (). sort values (

’popularity’ , ascending=False

).compute()

Mathias Asplin Big data analytics and processing

Listing 214: Dask Data Analytics/Exploration Task 3 Code

# Find the artist who on average makes the most danceable tracks

df.groupby(’artist name’)[ ’ danceability ’ ]. mean().reset index (). sort values (

’ danceability ’ , ascending=False

).compute()

Listing 215: Dask Data Analytics/Exploration Task 4 Code

# Find the artist that makes the longest tracks on average

df.groupby(’artist name’)[ ’duration’ ]. mean().reset index (). sort values (

’duration’, ascending=False

).compute()

Listing 216: Dask Data Analytics/Exploration Task 5 Code

# Find the artist who makes the happiest tracks on average

df.groupby(’artist name’)[ ’ valence’ ]. mean().reset index (). sort values (

’valence’ , ascending=False

).compute()

Listing 217: Dask Data Analytics/Exploration Task 6 Code

# Find the artist that collaborates with most record labels on average

df .groupby(’artist name’ )[ ’ label name’ ]. unique().apply(

len , meta=(’label name’, ’int64’)

). reset index (). sort values (’label name’, ascending=False).compute()

Listing 218: Dask Data Analytics/Exploration Task 7 Code

# Find the artist that has the most tracks with profanity

df[ df. explicit ]. groupby(’artist name’). size (). reset index (). rename(

columns={0: ’profanity’}

). sort values ( ’ profanity ’ , ascending=False).compute()

Listing 219: Dask Data Analytics/Exploration Task 8 Code

# Find the year with most popular tracks on average

df [ ’ year’ ] = df. release date . dt.year

df.groupby(’year’)[ ’popularity’ ]. mean().reset index (). sort values (

’popularity’ , ascending=False

).compute()

Listing 220: Dask Data Analytics/Exploration Task 9 Code

# Find the artist with the most popular tracks on average per year

df [ ’ year’ ] = df. release date . dt.year

(

df.groupby([’year’ , ’artist name’ ])[ ’popularity’ ]

.mean()

. reset index ()

. sort values ( ’popularity’ , ascending=False)

.drop duplicates(’year’ , keep=’ﬁrst ’ )

. sort values ( ’popularity’ , ascending=False)

).compute()

Listing 221: Dask Data Analytics/Exploration Task 10 Code

# Find the artist with the most track credits

(

Mathias Asplin Big data analytics and processing

df m = df.drop duplicates(’ track id ’ )[ ’ artist name all ’ ]

.str .count(’ , ’ ). reset index ()

)

df m = df m.rename(columns={’artist name all’: ’artist credits ’})

df temp = df.reset index (). merge(df m, on=’index’)[

[ ’ track title ’ , ’ artist credits ’ ]

]

df temp[’ artist credits ’ ] = df temp. artist credits + 1

df temp.sort values (’ artist credits ’ , ascending=False).head(10)

Listing 222: Dask Data Analytics/Exploration Task 11 Code

# Find the artist most active throughout the years

df.groupby(’artist name’)[ ’ release date ’ ]. apply(lambda x: x.max() −x.min(),

meta=(’release date’, ’ f8 ’ )

). reset index (). sort values (’ release date ’ , ascending=False).compute()

Listing 223: Dask Data Analytics/Exploration Task 12 Code

# Find the album type distribution

df temp = df.groupby(’album type’).size(). reset index (). rename(

columns={0: ’percentage’}

)

df temp[’percentage’] = (df temp[’percentage’] / df temp[’percentage’ ]. sum()) ∗100

df temp.compute()

Listing 224: Dask Data Analytics/Exploration Task 13 Code

# Find the album type that is the most popular on average

df.groupby(’album type’)[’popularity’ ]. mean().reset index ().compute()

Listing 225: Dask Data Analytics/Exploration Task 14 Code

# Find the mean of diﬀerent measurements

(

df[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’]

]. mean().reset index().rename(columns={0: ’mean’}).compute()

)

Listing 226: Dask Data Analytics/Exploration Task 15 Code

# Find the median of diﬀerent measurements

(

df[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’]

]. median approximate().reset index().rename(columns={0: ’median’}).compute()

)

Listing 227: Dask Data Analytics/Exploration Task 16 Code

# Find the standard deviation of diﬀerent measurements

(

df[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’]

]. std (). reset index (). rename(columns={0: ’std’}).compute()

)

Listing 228: Dask Data Analytics/Exploration Task 17 Code

Mathias Asplin Big data analytics and processing

# Find out what makes a track popular

(

df.groupby(’popularity’)[

[ ’ danceability ’ , ’energy’ , ’ liveness ’ ,

’speechiness’ , ’tempo’, ’valence’ , ’duration’]

]. mean()

. reset index (). rename(columns={0: ’popularity’}

). sort values ( ’popularity’ , ascending=False)

).compute()

Listing 229: Dask Data Analytics/Exploration Task 18 Code

Mathias Asplin Big data analytics and processing

J Appendix Graphs Code

import time

import duckdb

import daft as ft

import numpy as np

import pandas as pd

import polars as pl

import pyarrow as pa

import dask.dataframe as dd

from datetime import datetime

import matplotlib.pyplot as plt

import seaborn as sns

from functools import reduce

def ﬁll columns ( df tables ):

for index, table in enumerate(df tables):

col name = table.columns[0]. split ( ’ ’ )

for iin range(1, 19):

if f ’{col name[0]} {i} {col name[2]}’not in list (table . columns):

table [ f ’{col name[0]} {i} {col name[2]}’] = np.nan

col order = [f ’{col name[0]} {i} {col name[2]}’for iin range(1, 19)]

df tables [index] = table[ col order ]

return df tables

def ﬁll libraries ( df tables , libraries ):

for index, table in enumerate(df tables):

for library in libraries :

if library not in table .index:

table . loc [ library ] = pd.Series(dtype=’ﬂoat64’)

df tables [index] = table.sort index ()

return df tables

def transform dataframes( df list , libraries , values ):

df tables = [pd.pivot table(df, \

values=values, index=’Library’, columns=’Operation’)

for df in df list ]

if len( df tables ) >1:

df tables = ﬁll columns ( df tables )

df tables = ﬁll libraries ( df tables , libraries )

return df tables

schema = {

’Library’: str ,

’Operation’: str ,

’Category’: str ,

’WallTime’: ﬂoat ,

’CpuTime’: ﬂoat

}

names = list(schema.keys())

df tables = []

# Data ingestion

df dataingestion = pd.read csv( \

’DataIngestion.csv’, sep=’;’ , names=names, dtype=schema, header=None).round(2)

df dataingestion [ ’Ratio’] = df dataingestion.CpuTime / df dataingestion.WallTime

df tables += transform dataframes( \

[ df dataingestion ], df dataingestion .Library.unique(), ’WallTime’)

# Data wrangling

df datawrangling = pd.read csv( \

’DataWrangling.csv’, sep=’;’, names=names, dtype=schema, header=None).round(2)

df datawrangling[’Ratio’] = df datawrangling.CpuTime / df datawrangling.WallTime

df tables += transform dataframes( \

[df datawrangling], df datawrangling.Library.unique(), ’WallTime’)

# Data io

Mathias Asplin Big data analytics and processing

df dataio = pd.read csv( \

’DataIO.csv’, sep=’;’ , names=names, dtype=schema, header=None).round(2)

df dataio [ ’Ratio’] = df dataio.CpuTime / df dataio.WallTime

df tables += transform dataframes( \

[ df dataio ], df dataio .Library.unique(), ’WallTime’)

# Data exploration

df dataexploration = pd.read csv( \

’DataExploration.csv’, sep=’;’ , names=names, dtype=schema, header=None).round(2)

df dataexploration [ ’Ratio’] = df dataexploration.CpuTime / df dataexploration.WallTime

df bigdata = pd.read csv(’BigData.csv’, \

sep=’;’ , names=names, dtype=schema, header=None).round(2)

df bigdata[ ’Ratio’] = df bigdata.CpuTime / df bigdata.WallTime

df medium = df bigdata[df bigdata.Operation.str.contains(’medium’, case=False, regex=True)].copy()

df large = df bigdata[(df bigdata.Operation.str.contains(’ large ’ , case=False, regex=True)) &

(˜df bigdata.Operation.str.contains( ’very’ , case=False, regex=True))].copy()

df verylarge = df bigdata[df bigdata.Operation.str. contains( ’ verylarge ’ , case=False, regex=True)].copy()

df many = df bigdata[df bigdata.Operation.str.contains(’many’, case=False, regex=True)].copy()

df list = [df dataexploration, df medium, df large, df verylarge , df many]

df tables += transform dataframes(df list, df dataexploration.Library.unique(), ’WallTime’)

for iin range(8):

df tables [ i ] = df tables [ i ]. round(3)

df tables [ i ]. columns = df tables[i ]. columns.str. replace( ’ ’ , ’ ’ )

df tables [ i ]. index = df tables[ i ]. index. str .replace( ’ ’ , ’ ’ )

for iin range(3, 8):

df tables [ i ]. columns = [f’Task {i}’for iin range(1, 19)]

df sub table = df tables [0]. loc [[ ’Pandas python’, ’Pandas c’, ’Pandas numpy’]]

df sub table = df sub table. ﬁllna ( df sub table .min())

df dataio = df tables [0]. copy()

df dataio . loc [[ ’Pandas python’, ’Pandas c’, ’Pandas numpy’]] = df sub table

df read merge sum = df dataio.sum(axis=1).drop(

[ ’Pandas c’, ’Pandas python’]

). reset index (name = ’Read Merge’)

df data wrangling sum = df tables[1].sum(axis=1).reset index(name = ’Data Wrangling’)

df read write csv sum = df tables [2][

[ ’Read csv’, ’Write csv’ ]

]. sum(axis=1, min count=1).reset index(name = ’Read Write Csv’)

df read write parquet pa sum = df tables [2]. drop(

[ ’Polars outofcore’ ]

)[

[ ’Read pyarrow parquet’, ’Write pyarrow parquet’]

]. sum(axis=1, min count=1).reset index(name = ’R/W Parquet (PA)’)

df read write parquet r sum = df tables [2]. loc [[ ’Polars inmem’, ’Polars outofcore’ ]][

[ ’Read rust parquet’, ’Write rust parquet’]

]. sum(axis=1).reset index(name = ’R/W Parquet (R)’)

df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),

[

df read merge sum,

df data wrangling sum,

df read write csv sum,

df read write parquet pa sum,

df read write parquet r sum

]). sort values (by = ’Library’)

df tables .append(df merged.set index(’Library’))

df merged = df merged.set index(’Library’)

df largest = df merged.apply(pd.Series.nsmallest, n=1, axis=0).reindex(df merged.index, axis=0)

df merged = df merged.reset index(’Library’)

df largest [˜ df largest . isnull ()] = 1

df tables .append(df largest)

df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),

Mathias Asplin Big data analytics and processing

[

df tables [3]. dropna(how=’all’, axis=0).dropna(axis=1) \

.sum(axis=1).reset index(name=’Small’),

df tables [4]. dropna(how=’all’, axis=0).dropna(axis=1) \

.sum(axis=1).reset index(name=’Medium’),

df tables [5]. dropna(how=’all’, axis=0).dropna(axis=1) \

.sum(axis=1).reset index(name=’Large’),

df tables [6]. dropna(how=’all’, axis=0).dropna(axis=1) \

.sum(axis=1).reset index(name=’Verylarge’),

df tables [7]. dropna(how=’all’, axis=0).dropna(axis=1) \

.sum(axis=1).reset index(name=’Many’)

])

df merged = df merged.set index(’Library’).reindex( df tables [3]. index.unique(), axis=0) \

. reset index (names = ’Library’)

df merged = df merged.sort values(by = ’Library’)

df tables .append(df merged.set index(’Library’))

df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’index’ ], how=’outer’, sort=True),

[

df tables [3]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \

. reset index (name = ’Small’),

df tables [4]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \

. reset index (name = ’Medium’),

df tables [5]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \

. reset index (name = ’Large’),

df tables [6]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \

. reset index (name = ’Verylarge’),

df tables [7]. dropna(how=’all’, axis=0).dropna(axis=1).idxmin().value counts() \

. reset index (name = ’Many’)

])

df merged = df merged.rename(columns={’index’: ’Library’})\

. set index( ’Library’ ). reindex( df tables [3]. index, axis=0)

df merged = df merged.sort values(by = ’Library’)

df tables .append(df merged)

schema = {

’Library’: str ,

’Operation’: str ,

’Category’: str ,

’WallTime’: ﬂoat ,

’CpuTime’: ﬂoat

}

names = list(schema.keys())

df ratio tables = []

# Data ingestion

df dataingestion = pd.read csv(’DataIngestion.csv’, \

sep=’;’ , names=names, dtype=schema, header=None).round(2)

df dataingestion [ ’Ratio’] = df dataingestion.CpuTime / df dataingestion.WallTime

df ratio tables += transform dataframes([df dataingestion], df dataingestion.Library.unique(), ’Ratio’)

# Data wrangling

df datawrangling = pd.read csv(’DataWrangling.csv’, \

sep=’;’ , names=names, dtype=schema, header=None).round(2)

df datawrangling[’Ratio’] = df datawrangling.CpuTime / df datawrangling.WallTime

df ratio tables += transform dataframes([df datawrangling], df datawrangling.Library.unique(), ’Ratio’)

# Data io

df dataio = pd.read csv(’DataIO.csv’, sep=’;’ , \

names=names, dtype=schema, header=None).round(2)

df dataio [ ’Ratio’] = df dataio.CpuTime / df dataio.WallTime

df ratio tables += transform dataframes([df dataio], df dataio.Library.unique(), ’Ratio’)

# Data exploration

df dataexploration = pd.read csv(’DataExploration.csv’, sep=’;’ , names=names, dtype=schema, header=None).round(2)

df dataexploration [ ’Ratio’] = df dataexploration.CpuTime / df dataexploration.WallTime

Mathias Asplin Big data analytics and processing

df bigdata = pd.read csv(’BigData.csv’, sep=’;’ , names=names, dtype=schema, header=None).round(2)

df bigdata[ ’Ratio’] = df bigdata.CpuTime / df bigdata.WallTime

df medium = df bigdata[df bigdata.Operation.str.contains(’medium’, case=False, regex=True)].copy()

df large = df bigdata[(df bigdata.Operation.str.contains(’ large ’ , case=False, regex=True)) &

(˜df bigdata.Operation.str.contains( ’very’ , case=False, regex=True))].copy()

df verylarge = df bigdata[df bigdata.Operation.str. contains( ’ verylarge ’ , case=False, regex=True)].copy()

df many = df bigdata[df bigdata.Operation.str.contains(’many’, case=False, regex=True)].copy()

df list = [df dataexploration, df medium, df large, df verylarge , df many]

df ratio tables += transform dataframes(df list, df dataexploration.Library.unique(), ’Ratio’)

for iin range(8):

df ratio tables [ i ] = df ratio tables [ i ]. round(3)

df ratio tables [ i ]. columns = df ratio tables [ i ]. columns.str. replace( ’ ’ , ’ ’)

df ratio tables [ i ]. index = df ratio tables [ i ]. index.str . replace( ’ ’ , ’ ’ )

for iin range(3, 8):

df ratio tables [ i ]. columns = [f’Task {i}’for iin range(1, 19)]

df sub table = df ratio tables [0]. loc [[ ’Pandas python’, ’Pandas c’, ’Pandas numpy’]]

df sub table = df sub table. ﬁllna ( df sub table .min())

df dataio = df ratio tables [0]. copy()

df dataio . loc [[ ’Pandas python’, ’Pandas c’, ’Pandas numpy’]] = df sub table

df read merge sum = df dataio.mean(axis=1).drop(

[ ’Pandas c’, ’Pandas python’]

). reset index (name = ’Read Merge’)

df data wrangling sum = df ratio tables [1]. mean(axis=1).reset index(name = ’Data Wrangling’)

df read write csv sum = df ratio tables [2][

[ ’Read csv’, ’Write csv’ ]

]. mean(axis=1).reset index(name = ’Read Write Csv’)

df read write parquet pa sum = df ratio tables [2]. drop(

[ ’Polars outofcore’ ]

)[

[ ’Read pyarrow parquet’, ’Write pyarrow parquet’]

]. mean(axis=1).reset index(name = ’R/W Parquet (PA)’)

df read write parquet r sum = df ratio tables [2]. loc [

[ ’Polars inmem’, ’Polars outofcore ’ ]

][

[ ’Read rust parquet’, ’Write rust parquet’]

]. mean(axis=1).reset index(name = ’R/W Parquet (R)’)

df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),

[

df read merge sum,

df data wrangling sum,

df read write csv sum,

df read write parquet pa sum,

df read write parquet r sum

]). sort values (by = ’Library’)

df ratio tables .append(df merged.set index(’Library’))

df merged = df merged.set index(’Library’)

df largest = df merged.apply(pd.Series.nlargest, n=1, axis=0).reindex(df merged.index, axis=0)

df merged = df merged.reset index(’Library’)

df largest [˜ df largest . isnull ()] = 1

df ratio tables .append(df largest)

df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’Library’ ], how=’outer’, sort=True),

[

df ratio tables [3]. dropna(how=’all’, axis=0).dropna(axis=1) \

.mean(axis=1).reset index(name=’Small’),

df ratio tables [4]. dropna(how=’all’, axis=0).dropna(axis=1) \

.mean(axis=1).reset index(name=’Medium’),

df ratio tables [5]. dropna(how=’all’, axis=0).dropna(axis=1) \

.mean(axis=1).reset index(name=’Large’),

df ratio tables [6]. dropna(how=’all’, axis=0).dropna(axis=1) \

.mean(axis=1).reset index(name=’Verylarge’),

Mathias Asplin Big data analytics and processing

df ratio tables [7]. dropna(how=’all’, axis=0).dropna(axis=1) \

.mean(axis=1).reset index(name=’Many’)

])

df merged = df merged.set index(’Library’).reindex( \

df ratio tables [3]. index.unique(), axis=0).reset index(names = ’Library’)

df merged = df merged.sort values(by = ’Library’)

df ratio tables .append(df merged.set index(’Library’))

df merged = reduce(lambda left, right: pd.merge(left, right ,on=[’index’ ], how=’outer’, sort=True),

[

df ratio tables [3]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \

.value counts (). reset index (name = ’Small’),

df ratio tables [4]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \

.value counts (). reset index (name = ’Medium’),

df ratio tables [5]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \

.value counts (). reset index (name = ’Large’),

df ratio tables [6]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \

.value counts (). reset index (name = ’Verylarge’),

df ratio tables [7]. dropna(how=’all’, axis=0).dropna(axis=1).idxmax() \

.value counts (). reset index (name = ’Many’)

])

df merged = df merged.rename(columns={’index’: ’Library’})\

. set index( ’Library’ ). reindex( df ratio tables [3]. index, axis=0)

df merged = df merged.sort values(by = ’Library’)

df ratio tables .append(df merged)

df tables += df ratio tables [8:]

df tables [0] = df tables [0][

[ ’Read single’ , ’Merge single’ , ’Read merge’]

]

df tables [1] = df tables [1][

[ ’ Artists single ’ , ’Boolean single’ , ’Datetime single’ , ’Duration single’ , ’Data wrangling all’ ]

]

df tables += [df tables[4]/ df tables [3], df tables [5]/ df tables [4], \

df tables [6]/ df tables [5], df tables [7]/ df tables [5]]

df tables += [pd.DataFrame({’Library’: [’Pandas’, ’Polars’, ’Polars (R)’, ’Duckdb’, ’Daft’, ’Dask’],

’Parquet’: [515, 515, 623, 725, 515, 742],

’Csv’: [1741, 1728, np.nan, 1674, 2032, 1734]}). set index( ’Library’ )]

table max = [50, 50, 50, 50, 100, 500, 1000, 1000, 100, 1, 5000, 15, 5, 1, 5, 15, 30, 100, 100, 100, 2032]

for table , size max in zip( df tables , table max):

ﬁg = plt. ﬁgure ( ﬁgsize =[i∗6for iin (3,2)])

ax = sns.heatmap(table.T, annot=True, lw=0, cbar=False, fmt=’g’, \

annot kws={’fontsize’: 24}, vmax=size max)

ax.tick params(axis=’both’, labelsize =18)

ax.set (xlabel=None)

ax.set (ylabel=None)

plt . xticks(rotation=30)

plt . yticks(rotation=0)

plt . tight layout ()

plt .show()

Listing 230: Graphs Code

0 views·100 pages

VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF Free Download

VERTICAL SCALING FOR BIG DATA ANALYTICS AND PROCESSING - A CASE STUDY PDF free Download. Think more deeply and widely.

Uploaded by Linda Lewis on 4/10/2026

/100

100%