Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF Free Download

Name: Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF
Author: adamferguson01

1 / 76

0 views•76 pages

Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF Free Download

Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF free Download. Think more deeply and widely.

The Faculty of Health, Science and Technology

Computer Science

Pontus Sj¨oberg, Lina Vilhelmsson

Implementation and Evaluation of a Data

Pipeline for Industrial IoT Using Apache

NiFi

Bachelor’s Project

2020:06

Implementation and Evaluation of a Data

Pipeline for Industrial IoT Using Apache

NiFi

Pontus Sj¨oberg, Lina Vilhelmsson

2020 The author(s) and Karlstad University

This report is submitted in partial fulﬁllment of the requirements

for the Bachelor’s degree in Computer Science. All material in

this report which is not my own work has been identiﬁed and

no material is included for which a degree has previously been

conferred.

Pontus Sj¨oberg

Lina Vilhelmsson

Approved, June 01, 2020

Advisor: Prof. Andreas Kassler

Examiner: Per Hurtig

iii

Abstract

In the last few years, the popularity of Industrial IoT has grown a lot, and it is expected to

have an impact of over 14 trillion USD on the global economy by 2030. One application of

Industrial IoT is using data pipelining tools to move raw data from industry machines to

data storage, where the data can be processed by analytical instruments to help optimize

the industrial operations.

This thesis analyzes and evaluates a data pipeline setup for Industrial IoT built with

the tool Apache NiFi. A data ﬂow setup was designed in NiFi, which connected an SQL

database, a ﬁle system, and a Kafka topic to a distributed ﬁle system.

To evaluate the NiFi data pipeline setup, some tests were conducted to see how the

system performed under diﬀerent workloads. The ﬁrst test consisted of determining which

size to merge a FlowFile into to get the lowest latency, the second test if data from the

diﬀerent data sources should be kept separate or be merged together. The third test

was to compare the NiFi setup with an alternative setup, which had a Kafka topic as an

intermediary between NiFi and the endpoint.

The ﬁrst test showed that the lowest latency was achieved when merging FlowFiles

together into 10 kB ﬁles. In the second test, merging together FlowFiles from all three

sources gave a lower latency than keeping them separate for larger merging sizes. Finally,

it was shown that there was no signiﬁcant diﬀerence between the two test setups.

Acknowledgements

We want to thank our mentor at Karlstad University, Andreas Kassler, for helping us with

writing our report and guiding us through the project. We also want to thank Erik Hallin,

our mentor at Uddeholm AB, for helping and guiding us with all the diﬀerent tools and

software we used throughout the project, and giving us insight in how our implementation

might be used in an industry context. Lastly, we want to thank Uddeholm AB for letting

us do this project for them.

Contents

1 Introduction 1

2 Background 3

2.1 Introduction................................... 3

2.2 Concepts..................................... 3

2.2.1 Industrial Internet of Things . . . . . . . . . . . . . . . . . . . . . . 3

2.2.2 DataPipelining............................. 4

2.2.3 DataStreaming............................. 5

2.3 ApacheKafka.................................. 6

2.3.1 Topics .................................. 6

2.3.2 Cluster.................................. 6

2.3.3 Producers ................................ 7

2.3.4 Consumers................................ 7

2.4 ApacheNiFi................................... 8

2.4.1 PrimaryComponents.......................... 9

2.4.2 Extensions................................ 11

2.4.3 Security ................................. 12

2.4.4 Cluster.................................. 13

2.4.5 Compatibility.............................. 14

2.5 NiFi as a Producer and Consumer for Kafka . . . . . . . . . . . . . . . . . 14

2.5.1 MiNiFi.................................. 14

2.5.2 NiFiasaProducer ........................... 15

2.5.3 NiFiasaConsumer........................... 16

2.6 RelatedTools .................................. 16

2.6.1 ApacheAirﬂow ............................. 16

2.6.2 ApacheSpark.............................. 17

vii

2.6.3 ApacheStorm.............................. 17

2.6.4 AzureDataFactory........................... 18

2.6.5 Logstash................................. 18

3 Data Pipelining Architecture and Prototype 19

3.1 Introduction................................... 19

3.2 CurrentSetup.................................. 19

3.3 WhyBringinNiFi?............................... 21

3.4 NewPipeliningSetups ............................. 23

3.4.1 NewSetup................................ 23

3.4.2 Alternative New Setup . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 NiFiProcessorsUsed.............................. 26

3.5.1 Consuming from Kafka Topic . . . . . . . . . . . . . . . . . . . . . 27

3.5.2 Getting Files from File System . . . . . . . . . . . . . . . . . . . . 27

3.5.3 Getting Data from MariaDB . . . . . . . . . . . . . . . . . . . . . . 28

3.5.4 OtherProcessors ............................ 29

4 Experimental Setup 31

4.1 Introduction................................... 31

4.2 Additional Software Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Apache Hadoop and HDFS . . . . . . . . . . . . . . . . . . . . . . 32

4.2.2 MariaDB ................................ 33

4.3 ComputeNodes................................. 34

4.3.1 Node1.................................. 34

4.3.2 Node2.................................. 35

4.3.3 Node3.................................. 35

4.4 Experiment Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 35

viii

4.4.2 TestDescriptions ............................ 37

5 Results & Evaluation 40

5.1 Introduction................................... 40

5.2 ResultsofTest1 ................................ 40

5.3 ResultsofTest2 ................................ 44

5.4 ResultsofTest3 ................................ 46

5.5 ConclusionofResults.............................. 48

6 Conclusions 50

6.1 Project Summary and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 FutureWork................................... 51

References 54

A Appendix 59

A.1 Python Script for Processing Kafka Messages . . . . . . . . . . . . . . . . . 59

A.2 SQL Script for Loading Rows into MariaDB . . . . . . . . . . . . . . . . . 59

A.3 Software Download Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.4 RawData .................................... 59

A.5 Pictures ..................................... 60

List of Figures

2.1 A simpliﬁed view of the Kafka architecture . . . . . . . . . . . . . . . . . . 7

2.2 NiFi’sGUI.................................... 9

2.3 A simpliﬁed view of the NiFi architecture . . . . . . . . . . . . . . . . . . . 9

2.4 A simpliﬁed view of a NiFi Cluster . . . . . . . . . . . . . . . . . . . . . . 14

3.1 The current setup at Uddeholm AB . . . . . . . . . . . . . . . . . . . . . . 20

3.2 New setup with NiFi sending data directly to HDFS . . . . . . . . . . . . . 24

3.3 Alternative new setup with NiFi sending data to HDFS through Kafka . . 25

3.4 The processors used in NiFi for the new setup . . . . . . . . . . . . . . . . 26

4.1 The three compute nodes used for the experiment. . . . . . . . . . . . . . . 34

4.2 NiFi data ﬂow for the second test. . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 NiFi data ﬂow for alternative new setup used for the third test. . . . . . . 39

5.1 Average FlowFile latency for diﬀerent merging sizes. . . . . . . . . . . . . . 40

5.2 Percentage of the total average FlowFile latency made up by the time be-

tween MariaDB and NiFi, before being sent to HDFS. . . . . . . . . . . . . 42

5.3 Average FlowFile latency for diﬀerent merging sizes. . . . . . . . . . . . . . 42

5.4 Latency distribution for diﬀerent merging sizes. . . . . . . . . . . . . . . . 42

5.5 Average FlowFile latency for diﬀerent merging sizes, comparing separate

andcombinedmerging.............................. 44

5.6 Latency distribution for combined and separate merging. . . . . . . . . . . 44

5.7 Average throughput for the two diﬀerent sources with diﬀerent amounts of

1kBsources. .................................. 46

5.8 Average FlowFile latency for the two diﬀerent setups. . . . . . . . . . . . . 47

5.9 Latency distribution for the two diﬀerent setups. . . . . . . . . . . . . . . . 47

A.1 Full-size version of Figure 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.2 Full-size version of Figure 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.3 Full-size version of Figure 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . 62

List of Tables

4.1 Intervals for achieving diﬀerent merging sizes . . . . . . . . . . . . . . . . . 37

1 Introduction

The Industrial Internet of Things, or Industrial IoT, is a subset of Internet of Things (IoT)

speciﬁc for an industrial use, and it covers the machine-to-machine and industrial com-

munication parts of IoT. [1] Industrial IoT has grown a lot in the past few years, and it

is expected by some to have an impact on the global economy of over 14 trillion USD by

the year 2030. [2] Industrial IoT focuses on integrating and interconnecting already exist-

ing devices, whereas ”consumer” IoT (e.g. smart devices) focuses more on creating new

devices. An example of an Industrial IoT application is collecting large amounts of data

from industrial machines, and sending this data to various analytical tools which then can

optimize the industrial operations, based on how the machines are performing currently.

[1, 3] One way this can be done is by the use of data pipelining tools, which are tools for

moving data from one place to another. [4]

In this project, we will try to evaluate data pipelining and data ﬂows in the context

of Industrial IoT by creating data pipelining setups in the tool Apache NiFi (or, simply

NiFi) and try to ﬁnd the best way to include NiFi into an architecture where data needs

to be moved from several starting points into a cloud-based ﬁle system. To evaluate this

setup, we will also be testing the performance of the data ﬂow setup in NiFi with diﬀerent

conﬁgurations and workloads.

Currently, there are very few scientiﬁc papers available that test the performance of

a NiFi data ﬂow. [5, 6] Therefore, the result of this project is interesting to the task

provider Uddeholm AB, as they are looking into using NiFi as part of their data streaming

architecture. The speciﬁc tests performed in this project are designed with Uddeholm AB

in mind, to answer the questions they have about the performance of a NiFi data ﬂow

setup.

The disposition of the report is as follows:

In Chapter 2, some background to the technologies, concepts, and tools used is given.

The technologies and concepts described in this chapter are Industrial IoT, data pipelining,

and data streaming. The tools Apache Kafka and Apache NiFi are also described in detail

in this chapter, along with some shorter information about some alternative data pipelining

tools. Chapter 3 describes the data pipelining prototype designed and implemented in the

project. The ideas behind the design and why Apache NiFi is used are explained. The new

setup is described both in a more abstract view, and more speciﬁcally by looking at how

the NiFi ﬂow was set up. Chapter 4 presents the experiments that were performed to test

the NiFi implementation. A more speciﬁed description of the setup is given, explaining

the exact software that was used. The setups for the three tests that were performed

are described. Chapter 5 presents the results of the tests described in part 4, along with

evaluations of the results. Finally, Chapter 6 gives a conclusion of the report. Possible

future work is also proposed here.

2 Background

2.1 Introduction

This chapter gives a background to the technologies and the main tools used in the project.

The chapter starts with part 2.2, which goes through some of the important concepts and

technologies that are relevant to the project. Part 2.3 gives a description of Apache Kafka,

and part 2.4 describes Apache NiFi. Part 2.5 looks at the possibilities of combining NiFi

and Kafka. Part 2.6 introduces some other related tools.

2.2 Concepts

2.2.1 Industrial Internet of Things

The Industrial Internet of Things, or the Industrial IoT, is a subset of the Internet of

Things, IoT, which is an umbrella term for systems of interconnected computing devices

and machines which can transfer data to each other without human-to-human or human-

to-computer interaction. The popularity of IoT has increased in the last few years, with

the rise of ”smart” consumer devices such as smartphones, the Apple Watch, and Amazon

Echo. [7] This category of devices, while often colloquially called simply ”IoT” devices, is

only a part of all of IoT, and is classiﬁed in [1] as consumer IoT, to distinguish it from

Industrial IoT. The Industrial IoT subset covers the machine-to-machine and industrial

communication parts of IoT. Industrial IoT is about connecting the domains of operational

technology with information technology (IT) domains. For example, large amounts of data

collected from industrial machines can be connected to analytical instruments which can

help optimize the industrial operations. While consumer IoT focuses on creating new

devices, Industrial IoT focuses more on already existing devices, and how to integrate and

interconnect these with each other. There is also a diﬀerence in the amount of generated

data from consumer IoT and Industrial IoT. Consumer IoT data volumes are dependent

on the application, whereas Industrial IoT data is meant for analytics, which leads to

Industrial IoT generating very large amounts of data, up to several terabytes of data every

minute. [1, 3]

The architecture of Industrial IoT systems can be visualized as being built up of four

layers. The topmost layer is the content layer, which encompasses the user interface of the

system, such as a screen. The next layer is the service layer, which holds applications that

can perform operations on data which then can be displayed in the content layer. Next is

the network layer, which consists of physical network buses and communication protocols,

which aggregate data and then transport it to the service layer. The bottom layer is the

device layer, which contains the physical (hardware) components of the system, such as

sensors, machines, and cyber-physical systems (CPS). [8]

2.2.2 Data Pipelining

Data pipelining is an implementation technique where instructions are carried out in par-

allel instead of one after the other. In a system which does not use pipelining, each item

will have to go through all of the instructions in the system before the next item can enter

into the system. When using pipelining, each item still has to go through all the necessary

steps and instructions, but as soon as the ﬁrst item moves from the ﬁrst to the second

instruction, the next item in line can go through the ﬁrst instruction. By using pipelining,

the latency for each individual item going through the system will stay the same, however,

the throughput of a system will be increased. [9]

Another aspect of pipelining is that the output of one process (or instruction) is the

input of the next process in the pipeline. Just like with a real-life pipe built up out of

multiple diﬀerent segments, the data pipeline brings data from the input through multiple

processes which all feed into the next process in line, until the data ﬁnally reaches the

output point. [4]

The latter aspect of pipelining, where the output of one process becomes the input of

the next process, is the main function that a software pipelining tool provides. There are

many tools on the market for data pipelining, all with diﬀerent strengths and weaknesses.

[10, 11, 12, 13, 14] This thesis will focus primarily on Apache NiFi and secondarily on

Apache Kafka, but some alternative tools are described in part 2.6.

With the use of data pipelining tools, raw data can be moved from one system to

another. Usually, this means moving data from sensors, databases, etc., and putting the

data into a data lake, the cloud, or some other data storing solution. Here, the data will

be stored and analyzed. This analysis can be performed with machine learning algorithms,

which can notice patterns and anomalies in the data. This is an example of how data

pipelining tools can be used for Industrial IoT. [1, 15]

2.2.3 Data Streaming

Data streaming is the process of data being continuously generated by many diﬀerent

sources, where each data point usually is quite small (in the order of kilobytes). Streamed

data can be processed with stream processing techniques, where each piece of data in

the stream has operations done to it while the data is being streamed from one point

to another. Streaming data is a good technique to use when the goal is to continuously

analyze dynamic data and make decisions based on how the streamed data is behaving,

for example with machine learning algorithms. [16]

Stream processing can be compared to batch processing, which is a method where jobs

or tasks are processed in batches, instead of one after the other as with stream processing.

[17]

2.3 Apache Kafka

Apache Kafka is a distributed data streaming platform developed by LinkedIn in 2011.

[18] Kafka uses publish-subscribe techniques to stream messages. In Kafka, the publishers

are called producers, and the subscribers are known as consumers. The producers and

consumers are connected through a server, in Kafka terms called a broker, which takes in

the messages from the producers and sends them to the appropriate consumer(s). The

messages are stored and organized in diﬀerent topics within the broker. Producers send

messages to the topics, and consumers read messages from the appropriate topic. [19, 20]

2.3.1 Topics

Topics are categories into which record streams are published in Kafka. Topics are multi-

subscriber, which means that each topic can have zero, one, or more consumers that

subscribe to the contents. Each topic has a partitioned log, which is an ordered sequence

of records to which new records are appended. This partitioning allows a topic to have a

log that is larger than what can ﬁt on a single server. The records in each partition are

given unique sequence id’s called oﬀsets, to be able to uniquely identify each record within

a partition. All published records within a topic are stored for a speciﬁed amount of time,

during which it can be read by consumers. This period of time is called the retention

period. After this time is up, the record is discarded in order to free up space in the topic.

[19]

2.3.2 Cluster

One or more brokers build up a Kafka cluster. The basic architecture of Kafka can be

seen in ﬁgure 2.1. The brokers are managed by a ZooKeeper, another Apache product,

which also counts as part of the cluster. Apache ZooKeeper is a service for coordinating

distributed systems like Apache Kafka. ZooKeeper provides a shared hierarchical name

space for coordination between distributed processes. [21]

Figure 2.1: A simpliﬁed view of the Kafka architecture

2.3.3 Producers

A producer in Kafka can publish data to whichever topic(s) they choose. The producer

also chooses which record should be assigned to which partition within the chosen topic.

This can be done in diﬀerent ways, either using Round Robin to distribute the load evenly,

or based on some semantic partitioning function (e.g. based on some key in the record).

[19]

2.3.4 Consumers

A Kafka consumer can subscribe to one or more topics, and get the data that is published

to these topics by the producers. Consumers are divided into consumer groups, and within

these groups, only one consumer gets the records it subscribed to, and the record will

then be distributed to all members of the consumer group. Commonly, the consumers are

grouped together as ”logical subscribers”, i.e. all consumers which are subscribed to the

same set of topics create a consumer group together. Each group will usually contain many

consumer instances, which leads to higher fault tolerance. [19]

2.4 Apache NiFi

Apache NiFi is an open-source platform for managing data ﬂows and data pipelines through

a system. Apache NiFi started out as Niagaraﬁles, developed by the American National

Security Agency (NSA) in 2006. NiFi was donated to the Apache Software Foundation

(ASF) in 2014, and in 2015 it became a Top Level Project for ASF. The design of the NiFi

platform design is based on the ﬂow-based programming (FBP) model. The FBP model

oﬀers features that include the ability to run within clusters, security with TLS encryption,

extensibility and improved usability features. [22] NiFi was created with IoT in mind, and

is often used in an Industrial IoT context. [23, 24, 25, 26, 27]

NiFi’s core is a data ﬂow that is built on the main concepts of FlowFiles, FlowFile

Processors, and Connections. FlowFiles represent each object moving through the data

ﬂow. Each FlowFile has a UUID (Universally Unique Identiﬁer), a ﬁle name, a size, and

can also have content, although this is not necessary. FlowFile Processors is where the

actual work in NiFi is performed. The processors get access to FlowFiles, their attributes

and their content. Processors are building blocks for NiFi data ﬂows, and they are the

most used component in NiFi. Processors can be used, for example, to create, delete,

modify, or inspect FlowFiles before sending them to the next processor or to an endpoint

outside NiFi. Processors can be grouped together into Process Groups, which are similar to

subnets in the FBP model. Each process group is a set of processors and their connections

which can receive and send data through input and output ports. By using process groups,

creation of new components is done simply by combining other components. Connections

provide linkage between processors, and they act as queues which in turn allow processors

to interact at diﬀerent rates. [23]

NiFi is run through a graphical user interface (GUI) which makes it easy to visualize the

ﬂow components (see ﬁgure 2.2), unlike some other similar tools (such as Apache Kafka)

which only use the command line interface. The GUI uses drag-and-drop techniques for

building up the data ﬂow with processors and connections.

Figure 2.2: NiFi’s GUI

Figure 2.3: A simpliﬁed view of the NiFi architecture

2.4.1 Primary Components

NiFi is a Java program that is run on a Java Virtual Machine (JVM) on the hosting

server. NiFi’s primary components on the JVM are the Web Server, the Flow Controller,

Extensions (or extension points), the FlowFile Repository, the Content Repository, and

the Provenance Repository (see ﬁgure 2.3).

The Web Server hosts NiFi’s HTTP-based commands and control API.

The Flow Controller is the component that provides threads for extensions to run on.

It manages when extensions receive resources and when they execute. The Flow Controller

acts as a broker between processors, facilitating the exchange of FlowFiles. The extensions

will be described in further detail in part 2.4.2.

The FlowFile Repository is the component that allows NiFi to keep track of the state

of what is known about the FlowFile that is currently active in the data ﬂow. It acts as a

Write-Ahead-Log of the metadata of each FlowFile that currently exists in the system. This

metadata includes the attributes of a FlowFile, a pointer to the content of the FlowFile (in

the Content Repository, more about this repository later in this section) and the state of the

FlowFile. Each change of a FlowFile is logged in this repository before it is executed. With

this information in the Write-Ahead-Log, NiFi is able to handle restarts and unexpected

system failures. NiFi resumes where it was stopped by checking the Write-Ahead-Log along

with the snapshot of the FlowFile in the Provenance Repository. The default approach of

the FlowFile Repository is to have the Write-Ahead Log on a speciﬁed disk partition but

the repository itself is pluggable (can be plugged in and out at will).

The actual content of a FlowFile which is currently in the NiFi ﬂow is located in

the Content Repository. This content is stored locally on disk and is only read to the

JVM memory when needed. By doing this, the producer and consumer processors do

not need to hold on to objects in memory. When the content of a FlowFile is no longer

in use, the content will be deleted or archived. Archiving of content can be enabled or

disabled in the nifi.properties ﬁle. If the content is archived it will remain in the

Content Repository until a certain amount of time has passed, a maximum archiving

time which is set in nifi.properties. If the Content Repository is taking up too much

space, content that is archived or marked as no longer in use will be deleted, this is also

maintained in nifi.properties. The Content Repository is also pluggable but by default

the implementation of the Content Repository is a simple mechanism where blocks of data

are stored in the ﬁle system. To help reduce conﬂict on any single volume, more than one

ﬁle storage location can be speciﬁed to get diﬀerent physical partitions.

In the Provenance Repository, NiFi stores all the provenance event data (metadata).

Provenance is a record of all that has happened to a FlowFile, i.e. the history of the

FlowFile. A new provenance event is created every time an event occurs to a FlowFile.

These provenance events give a snapshot of the FlowFile at that time. All of the attributes

and pointers to the FlowFile’s content are copied and stored in the provenance event, along

with the state of the FlowFile. The state can include information about the FlowFile’s

relationship with other provenance events, among other things. The provenance events

are stored in the Provenance Repository until they expire. The time until they expire is

speciﬁed in the nifi.properties ﬁle. The construct of the repository is pluggable but

by default it is implemented in such a way that it uses one or more physical disk volumes.

Event data is indexed and searchable within each location. [23, 28]

2.4.2 Extensions

As mentioned previously, NiFi has a number of extension points. These extension points

give developers the ability to add features to the platform so that NiFi meets their needs.

The most common extension points, along with the already described processors, are the

ReportingTask, the ControllerService, the FlowFilePrioritizer, and the AuthorityProvider.

[29]

The ReportingTask interface is the mechanism that allows NiFi to publish metrics,

monitoring information and internal NiFi states to endpoints. Endpoints can be log

ﬁles, e-mail, or remote web services, for example.

The ControllerService gives shared state and functionality to processors, other Con-

trollerServices and ReportingTasks in a single JVM. This could for example be used

when loading a large data set into memory. By using a ControllerService the data set

can be loaded once and given to all Processors instead of each Processor individually

loading the dataset. The ControllerService is also used if there is a need to establish

a connection to an external server.

The FlowFilePrioritizer interface provides prioritizing and sorting for FlowFiles that

are placed in a queue so that they can be processed in the order that is most eﬀective

for that use case.

The AuthorityProvider is used for determining privileges and roles for a given user, if

there are any.

2.4.3 Security

NiFi can ensure a secure connection by using SSL, SSH, HTTPS and encryption of content.

[30] NiFi uses two-way SSL in the data ﬂows from system to system but also from user to

system. In system-to-system cases NiFi uses two-way SSL in every exchange of the data

ﬂow by encrypting and decrypting through shared keys on each side for both the sender

and the recipient. For user-to-system security, NiFi uses two-way SSL authentication to

be able to authorize a user and control the correct level of access for the user (read-only,

data ﬂow manager, or admin). [23]

NiFi also supports authentication of users through client certiﬁcates with login and

password instead of the built-in SSL authentication. This can be done by a ”Login Iden-

tity Provider”, which is a pluggable mechanism for authenticating users with login and

password. Which Provider to use is conﬁgured in the nifi.properties ﬁle. The au-

thentication can currently be done through Apache Knox, Kerberos, OpenID connect, or

Lightweight Directory Access Protocol. [31]

NiFi also provides Multi-Tenant Authorization. Multi-Tenant Authorization means

that the authority level of a data ﬂow is applied to each component of that data ﬂow.

Once authenticated, users are split up into groups of users (tenants) that can command,

control or observe the data ﬂow with diﬀerent levels of authorization. If a user tries to view

or modify a NiFi resource, the system will check if the user has privileges to perform that

action. The privileges are determined by policies and these policies are managed by an

authorizer. The authorizers are conﬁgured with two properties in the nifi.properties

ﬁle, the nifi.authorizer.configuration.file property and the nifi.security.user

property.

The ﬁrst property speciﬁes the conﬁguration ﬁle and it is here that the authorizers are

deﬁned. By default the conﬁguration is set so that the authorizers.xml ﬁle is selected.

The authorizers.xml ﬁle is used to conﬁgure and deﬁne authorizers, the default authorizer

is the StandardManagedAuthorizer. This authorizer contains a UserGroupProvider and a

AccessPolicyProvider. These providers are used to load and optionally conﬁgure the users,

groups and access policies. The StandardManagedAuthorizer will make all of the access

decisions based on the these policies.

The second property indicates which of the authorizers that are conﬁgured in the

authorizers.xml ﬁle should be used. [31, 32]

2.4.4 Cluster

Like Apache Kafka, NiFi can operate within a cluster, and NiFi does this by using a Zero-

Master Clustering paradigm. Zero-Master Clustering entails that each node in the cluster

performs the same tasks but on diﬀerent sets of data. NiFi uses an embedded ZooKeeper,

see ﬁgure 2.4, which chooses a node as the Cluster Coordinator. The Cluster Coordinator

is in turn responsible for connecting and disconnecting nodes. Each node in the cluster

reports heartbeat and status of the node to the Cluster Coordinator, and the Cluster

Coordinator disconnects a node if it does not report any heartbeat status for a set amount

of time. Each cluster also has a Primary Node. The Primary Node is able to run isolated

processes. Since the rest of the nodes run all of the tasks in the data ﬂow, the Primary

Node can run a process in isolation. It is possible to conﬁgure in a processor whether to

execute on all nodes in the cluster, or only on the Primary Node. Any potential fail-over

is handled automatically by the ZooKeeper. [29, 31]

Figure 2.4: A simpliﬁed view of a NiFi Cluster

2.4.5 Compatibility

Through its processors, NiFi is compatible with many other tools. NiFi can work against

several diﬀerent databases, both SQL (like Apache Hive, InﬂuxDB, and MariaDB) and

NoSQL (MongoDB, Couchbase, DynamoDB, and HBase). NiFi can read from and write

to both regular ﬁle systems (both remotely through SSH and locally) and the distributed

ﬁle system HDFS. [23] As described in part 2.5, NiFi can use Kafka as both a sink and

source for data.

2.5 NiFi as a Producer and Consumer for Kafka

NiFi can work as both a producer and a consumer for Kafka by implementing the processors

PublishKafka and ConsumeKafka respectively. The NiFi sub-project MiNiFi can be used

as an alternative to just using PublishKafka. [33]

2.5.1 MiNiFi

MiNiFi is a sub-project of Apache NiFi. Whereas NiFi is implemented in a data center,

MiNiFi is implemented at the ”edge”, i.e. close to where the data is created at sensors or

an IoT implementation. MiNiFi does not have its own UI, instead the ﬂows are created in

NiFi and then exported into MiNiFi. MiNiFi is much smaller than NiFi, and takes up less

than 100 MB of space, whereas NiFi needs several GB of space. [34]

2.5.2 NiFi as a Producer

NiFi as a producer will take a FlowFile as input and forward it to a topic at a Kafka broker.

The main way to have NiFi as a Kafka producer is by using the PublishKafka processor.

PublishKafka takes the contents of a FlowFile and puts it into a Kafka topic using the

KafkaProducer API. The contents of the FlowFile are converted to a Kafka message. The

PublishKafka processor includes an optional Message Demarcator. The demarcator is

used to decide where the separation between messages should be, and in this way it can

decide if the contents of a FlowFile should be sent as several messages or as one large

message. One large message will be the default approach if the Message Demarcator is not

set. PublishKafka will distribute the messages to a Kafka topic following the Round Robin

principle between partitions, depending on the number of partitions. If some messages for

a given FlowFile fail to send but some are successfully sent, the entire FlowFile will be

considered a failed FlowFile. The messages that fail to send will have a separate attribute

set which index the last message that was successfully ACKed by Kafka. This attribute

will allow PublishKafka to re-send the messages that have not been ACKed by Kafka.

[33, 35]

It is also possible to implement the PublishKafka processor in MiNiFi. This can be

used to get data more directly from the source to Kafka, instead of having it go through

the NiFi center.

Lastly, MiNiFi and NiFi can be combined, where MiNiFi delivers the data from the

source, and NiFi then uses the PublishKafka processor to publish messages to a Kafka

topic. [33]

2.5.3 NiFi as a Consumer

NiFi can also act as a Kafka consumer. In this case, the NiFi process ConsumeKafka replaces

the Kafka consumer and will handle all the data from the chosen Kafka topic(s) and deliver

it to where it needs to go. No code needs to be written to implement this, you drag and

drop the consumer process ConsumeKafka into the NiFi workﬂow. When Kafka sends a

message to NiFi, the ConsumeKafka process emits a FlowFile where the content of the

FlowFile is the content of the Kafka message. The ConsumeKafka processor also includes

an optional Message Demarcator. Unlike in the PublishKafka case, the demarcator in

the ConsumeKafka processor indicates that all of the messages that are received in a single

poll should be produced as one FlowFile, with the demarcator separating the messages

from each other. If the demarcator property is left blank, ConsumeKafka will produce one

FlowFile for each message received. [33, 36]

2.6 Related Tools

There are several diﬀerent tools that ﬁll similar functions as Apache NiFi, other than

Apache Kafka. This section will describe some of these tools.

2.6.1 Apache Airﬂow

Apache Airﬂow is a platform for creating and managing workﬂows through Python scripts.

Airﬂow displays the workﬂows as Directed Acyclic Graphs (DAGs) of tasks, which can

be easily modiﬁed through command line utilities. Since Airﬂow uses Python as a pro-

gramming language, libraries and classes can easily be imported for easier creation and

managing of workﬂows. Airﬂow also has a web UI which provides insight into the logs and

status of tasks in the workﬂow. Airﬂow has many built-in integrations with tools such as

Apache Hive, Spark, HDFS, MySQL, etc. Apache Kafka is not among these integrations,

however there are still ways to connect the two platforms, with a bit of work. Overall, it

is possible to do a lot of things with Airﬂow, but the user is required to write the code

for their workﬂows themselves. Airﬂow is not classiﬁed as a data streaming tool, since the

tasks do not send data between each other. [10]

2.6.2 Apache Spark

Apache Spark is a cluster computing system, which provides high-level API:s in several

diﬀerent programming languages, and supports general execution graphs. Spark provides

many libraries, which allow for work with SQL, machine learning (ML), streaming, and

more. These diﬀerent libraries can also easily be combined in Spark applications. The

Spark API has an extension called Spark Streaming, which is specially created for scalable,

high-throughput, and fault-tolerant stream processing of data streams. Spark Streaming

can ingest data from Kafka, Twitter, TCP sockets, and more, and can push the data to

ﬁle systems such as HDFS, databases and dashboards. Spark Streaming has no UI, and

programs are written in Scala, Java or Python. [11, 37]

2.6.3 Apache Storm

Apache Storm is used for real-time processing of data streams. Storm uses vertices in

the form of ”spouts” and ”bolts”, and edges in the form of ”streams” to visualize DAGs.

Spouts are the sources of the streams, and will generally read from any queuing system, like

Apache Kafka. It is also possible to conﬁgure a spout to create its own stream, or read from

something such as a Twitter streaming API. Bolts will then process any number of input

streams and produce output streams. Inside the bolts is where most of the computation

logic happens in Storm. The bolts can communicate with any database system. Storm can

be used with any programming language, and it can reliably process boundless streams

of data. Use cases for Apache Storm includes ETL (Extract, Transform, Load), real-time

analytics, and online ML. [12, 38]

2.6.4 Azure Data Factory

Azure Data Factory is a Microsoft service which is designed to integrate diﬀerent data

sources. It is a platform for managing data in the cloud and on-premises. It is used to

integrate data from storage systems to data-driven workﬂows for ETL and ELT (Extract,

Load, Transform). Data Factory is mainly used to load the data to Microsoft’s own Azure

SQL databases. Data Factory does not have any support for writing code to run. It also

lacks the ability to add processes and is very limited to the available tools in Data Factory.

Unlike the other tools presented, no free version of Data Factory is available. The price

of Data Factory depends on how many pipelines are orchestrated and executed, data ﬂow

executions and debugging, and the number of Data Factory operations that are being used.

[13, 39]

2.6.5 Logstash

Logstash is an open-source data processing pipeline from Elastic. Logstash ingests data

from a source (like Kafka, a ﬁle, or GitHub, to mention a few), transforms it in the

way you want, and then transports the data (or events) to a ”stash”. There are many

diﬀerent stashes, for examples sending the events to a TCP socket, publishing the data

to a websocket, writing events to a Kafka topic, or writing the events to a ﬁle on disk.

Logstash can be used as a pipelining tool but it is mainly used for storing and managing

logs and events. Logstash is run through the terminal and conﬁgurations are written in

bash. It is categorized as a log management tool, whereas NiFi is categorized as a data

streaming tool [14, 40].

3 Data Pipelining Architecture and Prototype

3.1 Introduction

This chapter will describe how we implemented a prototype for data pipelining in Apache

NiFi. In part 3.2, a setup which does not use NiFi is described. Part 3.3 goes over the

advantages with using NiFi in the pipeline, and part 3.4 shows the experimental setups

which will be implemented in NiFi. Lastly, part 3.5 shows what our data pipeline looks

like in NiFi, and describes the diﬀerent processors used to build up the pipeline.

3.2 Current Setup

The goal for this project is to evaluate a new setup for data pipelining that can be used

in an Industrial IoT context. In this part, the setup that is currently used at the task

provider, Uddeholm AB, is described. This is a setup which only uses Kafka as a tool to

transfer raw data from the source to various data sinks. This setup will from here on be

referred to as the ”old setup”.

Data collected by sensors in the industrial machines is transferred using either only

PLC4x or a combination of a traditional PLC and the data acquisition software iba (more

speciﬁcally the tool ibaDatCoordinator). This data is sent to the internal Kafka cluster,

which is containerized via Docker and orchestrated with Kubernetes. There are three Kafka

consumers: a web server, a cloud service (AWS), and the production system. The SCADA

system Ignition is used mainly for visualization. A model of the setup can be seen in ﬁgure

3.1.

Figure 3.1: The current setup at Uddeholm AB

PLC - A PLC, or programmable logic controller, is an industrial computer which has

been adapted in order to be used for automation, for example in an assembly line or

in robotic devices. [41]

PLC4x - PLC4x is an open-source tool from Apache. It is a universal protocol adapter

for industrial IoT. PLC4x has many built-in integration possibilities, of which Apache

NiFi and Apache Kafka are two examples. [42]

iba - iba is a system for acquiring and analysing process data. The tool ibaDatCoordinator

is used for automatic processing and managing of measurements. ibaDatCoordina-

tor provides an automatic generation of fault and quality reports, integrated status

monitoring, and it notiﬁes when set thresholds have been met. [43, 44]

Docker - Docker is a product for containerizing apps. Via Docker, it is possible to down-

load software in containers, and these are hosted on the Docker Engine. The Docker

Engine supports all diﬀerent types of applications, and it works on several diﬀerent

operating systems. [45, 46]

Kubernetes - Kubernetes is an open-source system that can be used to group containers

into logical units to simplify management. Kubernetes can help manage the contain-

ers that run applications, and make sure that downtime is minimized. [47]

Ignition - Ignition is a platform for building and deploying industrial applications. The

Ignition software can act as a hub for all systems on the plant ﬂoor, which allows for

complete system integration. With Ignition, data from databases can be illustrated

in the form of graphs and tables. [48]

3.3 Why Bring in NiFi?

The main diﬀerence between the old setup and the proposed new setup (described in part

3.4) is that NiFi is brought in as the main data ﬂow manager, partly replacing Kafka. In

this part, these two tools will be compared.

Both NiFi and Kafka are able to move data from one node to another, however NiFi

is classiﬁed as a data pipelining tool, whereas Kafka is classiﬁed as a distributed streaming

platform.

When comparing NiFi and Kafka one of the most obvious diﬀerences is that NiFi uses

a web-based GUI without the need for the user to write any code if they do not want

to. Kafka does not have a GUI and the user will need to write their own code to set up

consumers, producers, and topics. This makes NiFi much more user-friendly, and users

who do not have a lot of programming experience can ﬁnd it much more accessible than

Kafka. Thanks to NiFi’s GUI, it is also easy to oversee the workﬂow, something that is

harder to do in a similar way in Kafka.

Another diﬀerence is that Kafka is made for smaller messages and NiFi is made for

larger messages and streams. Performance wise it is faster for NiFi to create a single

FlowFile out of one million messages and sending that, instead of sending one million

one-message FlowFiles. [33]

NiFi and Kafka are quite equal when it comes to the level of security they can provide.

As described in section 2.4.3, NiFi has several ways to provide secure connections when

sending data through the pipelines. One of these ways is to use two-way SSL authentication.

Kafka can also provide security in several ways, since the 0.9.0.0 update. These security

measures include securing connections to brokers through SSL, SASL, or SASL/PLAIN

authentication; encryption of data being sent between brokers and clients, tools, or other

brokers through SSL; and authorization of read/write operations done by clients. The

Kafka cluster can be protected by making so that users have to use SSH to connect to the

cluster with a password. These security measures are all optional, which also is the case

with NiFi. [49] Another way to provide security in Kafka, and the way it is done in the

old setup, is by using an Avro schema. The Avro schema can minimize and compress the

data, and the data can then only be read by someone who has the schema. If the schema

is placed at a diﬀerent endpoint than the data, this provides encryption of the data. [50]

In Kafka, all nodes contain metadata about which servers are currently alive and where

the partition leaders of a topic are. They can share this metadata with a producer as an

answer to a request, and the producer in turn can send this data directly to the broker that

is the leader for the partition. [51] NiFi stores all the FlowFile metadata in the FlowFile

Repository and all of the provenance event data in the Provenance Repository, as explained

in section 2.4.1. It is possible to get metadata about sent messages in Kafka, but it is a

bit tricky to do and not as easy as in NiFi. In this aspect, NiFi has a clear advantage over

Kafka.

3.4 New Pipelining Setups

A new setup was designed as an alternative to the old setup, described in part 3.2. The

thought behind this setup is to insert NiFi between Kafka and the sink for the data, and

also to add a database and a regular ﬁle system as data sources. The method for moving

data from machines via PLC’s to Kafka will stay the same for the new setup, however for

the test simulations, fake data will be generated in Kafka. Furthermore, the new design

has only one data sink, instead of the three data sinks as seen in ﬁgure 3.1. This data

sink is a distributed ﬁle system, more speciﬁcally Hadoop Distributed File System, HDFS.

According to Erik Hallin at Uddeholm AB, the advantage with using HDFS is that it is

more scalable than the current data sinks.

An alternative setup was also designed, similar to the new setup but with Kafka inserted

between NiFi and the distributed ﬁle system. However, the main focus of the testing in

this project will be on the new setup, not the alternative new setup.

3.4.1 New Setup

The new setup has NiFi alone for pipelining data from the data sources to the HDFS

endpoint, as seen in ﬁgure 3.2. NiFi will act as a Kafka consumer for a Kafka topic

and forward Kafka messages along with data from a database and ﬁles from a remote

ﬁle system to HDFS. The Kafka producer will publish messages to a topic by use of the

kafka-producer-perf-test.sh ﬁle, which generates messages to a speciﬁed topic, with

a speciﬁed number of records (messages), record size, throughput, and using a producer

conﬁguration ﬁle. This is used to simulate data coming in from PLCs and being pub-

lished to a Kafka topic. To make sure that NiFi can ﬁnd the Kafka broker which is on

a diﬀerent machine, the variable advertised.listeners in the NiFi conﬁguration ﬁle

server.properties is set to PLAINTEXT://[kafka machine ip]:9092. Other than this,

there is no need to conﬁgure any ﬁles in Kafka in any special way, the default values work

for the purposes of this experiment.

Figure 3.2: New setup with NiFi sending data directly to HDFS

For Hadoop, the ﬁle core-site.xml is edited to add a conﬁguration of the property

fs.defaultFS to be hdfs://[hdfs machine ip]:9000, and the ﬁle hdfs-site.xml is

edited to conﬁgure the property dfs.replication to be 1 and the property

dfs.permissions.enabled to be false. The latter is done to turn oﬀ permission checks,

which enables NiFi to connect to HDFS. The ﬁles hdfs-site.xml and core-site.xml are

copied and added to the machine which hosts NiFi, since these are needed to connect NiFi

and HDFS.

3.4.2 Alternative New Setup

The alternative new setup is meant to test how it will work to send the data from NiFi

to a second Kafka topic before sending it to the endpoint in HDFS, as seen in ﬁgure 3.3.

This could be interesting if one already has a connection setup between Kafka and Hadoop.

Kafka and HDFS are conﬁgured as in the main new setup, and the NiFi ﬂow is built up

the same way, apart from the ﬁnal step where the data is sent to a Kafka topic instead of

a HDFS directory.

Figure 3.3: Alternative new setup with NiFi sending data to HDFS through Kafka

To send the data from Kafka to HDFS, a Python script is used, which uses the kafka-

python library to read messages from a topic. [52] The script then adds the contents of

each Kafka message along with a timestamp of the current time (expressed in the Unix

time stamp format) for each message to a local ﬁle. This local ﬁle then gets uploaded to

HDFS. This method of sending one larger ﬁle instead of many small ﬁles is used in order to

work around the long time (around 2 seconds) it takes to connect to HDFS for the sending

of each ﬁle. The Python script can be found in Appendix A.1.

3.5 NiFi Processors Used

The NiFi processors that are used to create the new setups are described here. For some

of the tests the NiFi ﬂow looks slightly diﬀerent, all but one of the processors (Pub-

lishKafka 2 0) used for all of tests can still be seen in ﬁgure 3.4 (full-size picture can be

found in Appendix A.5). The queue sizes for the connections between the processors were

increased to be able to hold 500 000 FlowFiles to try to make sure that the connections

would not inﬂuence the results. If a queue ﬁlls up, back pressure is applied to the previous

processor, which slows down the ﬂow.

Figure 3.4: The processors used in NiFi for the new setup. In the green box are processors

for consuming from Kafka topic, the yellow box is for getting ﬁles from the remote ﬁle

system, the red box is for pulling rows from the database. The blue box is for clearing out

HDFS, and the pink box is for putting ﬁles to the remote ﬁle system.

3.5.1 Consuming from Kafka Topic

ConsumeKafka 2 0 is used to consume messages from a Kafka topic. In this processor’s

properties, the IP address and port of the Kafka broker is set, along with the name

of the topic to consume data from. Using speciﬁcally ConsumeKafka 2 0 is necessary

for compatibility with the Kafka system, since it is of version 2.x. For earlier versions

of Kafka, there are other versions of the ConsumeKafka processor.

MergeContent is used to be able to merge several FlowFiles together for diﬀerent setups

and conﬁgurations of the data ﬂow. This processor is used for all three data sources

to test merging of FlowFiles. The properties Minimum Number of Entries and Max-

imum Number of Entries are both set to 1 when no merging is to happen, along with

a Minimum Group Size of 0 bytes and no Maximum Group Size set. These values

are changed to allow for merging when this is tested. The Maximum and Minimum

Group Sizes were mainly used to set size intervals for the merged FlowFiles.

PutHDFS is used for all three data sources. This processor needs the directories to the

ﬁles core-site.xml and hdfs-site.xml, which have been copied to the machine on

which NiFi is hosted. PutHDFS also needs the directory in HDFS where the ﬁles

should be put.

3.5.2 Getting Files from File System

GetSFTP is the processor which gets ﬁles from a remote ﬁle system hosted on a diﬀerent

machine than NiFi is. This processor uses SFTP (SSH File Transfer Protocol) to

connect to the remote ﬁle system. To do this, the processor needs the IP address of

the remote node, the SSH port (22), and a username with a corresponding SSH key

to be able to access the ﬁle system, which is in a deﬁned directory. The run schedule

of this processor is set to 0.01 seconds, meaning that GetSFTP will pull ﬁles from

the ﬁle system 100 times per second. The property Max Selects is set to 5 000, which

means that this is the maximum number of ﬁles to be pulled in a single connection.

MergeContent is conﬁgured as described in 3.5.1

PutHDFS is conﬁgured as described in 3.5.1

3.5.3 Getting Data from MariaDB

When gathering data from MariaDB to HDFS, NiFi needs to use more processors than for

the other data sources, as seen in ﬁgure 3.4. This is because SQL data needs to be handled

in another way than the other data sources. The ﬁrst processor, QueryDatabaseTable,

gathers the data from a chosen SQL table and that data is converted to Avro-format

FlowFiles. To be able to process these Avro FlowFiles, NiFi needs to convert them to

JSON format FlowFiles, hence the need to use more processors. The ﬁles from MariaDB

are the ﬁles used to measure latency in the tests, and in order to do this it is also necessary

to have two ReplaceText processors in this ﬂow.

The processors (and controller service) which are used are:

QueryDatabaseTable is used to extract incremental data based on a column from the

SQL table where the data is being gathered from. The processor gathers rows from

the table, and puts each row in a separate FlowFile. The processor needs to be

conﬁgured to know which database it should gather data from, what kind of database

and which table. This is conﬁgured in the properties of the processor. Furthermore,

the DBCPConnectionPool controller service needs to be conﬁgured for the processor

to be able to work and access the database. This controller service is described below.

The output of the QueryDatabaseTable processor is Avro format FlowFiles.

DBCPConnectionPool is the controller service that is used to obtain a connection to a

speciﬁed database. It uses a separate MariaDB Connector, which is a driver which

is used to connect Java applications with MariaDB. The controller service needs

this driver and the controller service needs to be conﬁgured with the database con-

nection URL, which contains the IP address along with the port and the name of

the database, the class driver name from the MariaDB Connector, location of the

MariaDB Connector along with the login credentials for the database.

ConvertAvroToJSON is used to convert the Avro format FlowFiles into JSON format

FlowFiles. The output JSON FlowFile is encoded with UTF-8 encoding. The JSON

container options property is set to none.

MergeContent is conﬁgured as described in 3.5.1.

ReplaceText is used to add a timestamp to the FlowFiles. The timestamp is in the

Unix time stamp format, expressed in milliseconds elapsed since midnight on 1st of

January, 1970. This timestamp is appended to the end of the FlowFile, and is used

to check the latency when sending data through the pipeline.

PutHDFS is conﬁgured as described in 3.5.1.

ReplaceText is used a second time to add another timestamp in the FlowFiles. This

timestamp represents the approximate time that a FlowFile gets sent to HDFS, since

the FlowFiles it gets as input are the successful ﬁles from the PutHDFS processor.

PutHDFS is used a second time to store the FlowFiles after the last timestamp is added.

These FlowFiles get sent to a diﬀerent HDFS directory than the other FlowFiles,

for easier extraction of timestamps. Other than this, the processor is conﬁgured the

same way as the other PutHDFS processors.

3.5.4 Other Processors

GenerateFlowFile is a processor which generates FlowFiles which are to be put to the

remote ﬁle system. The size of these ﬁles are set to 1 kB. For the purposes of this

experiment, there is no need for unique FlowFiles or a custom text as the content of

the FlowFiles.

PutSFTP takes the FlowFiles from GenerateFlowFile and puts them to the remote ﬁle

system data source. Like with the GetSFTP processor, the IP address of the remote

node, the SSH port, and a username with a SSH key are needed to connect to the ﬁle

system, along with the path in the ﬁle system where the ﬁles should be put. These

are the ﬁles that will be fetched in GetSFTP.

GetHDFS is a processor which takes ﬁles from HDFS and puts them into the NiFi ﬂow.

In our ﬂow however, this processor is only used to empty the contents of HDFS

between testing. To make this happen, the property ”Keep Source File” is set to

false, and ”Ignore Dotted Files” is set to true, to make sure all ﬁles get removed.

The process is also set to automatically terminate all successfully retrieved ﬁles, since

they are not needed in the NiFi ﬂow. As in PutHDFS, the paths to core-site.xml

and hdfs-site.xml are speciﬁed, as well as the directory which needs to be emptied.

For the data ﬂow which uses Kafka as an intermediary between NiFi and HDFS (for the

alternative setup), the processor PublishKafka 2 0 is used in the places where PutHDFS

is used in ﬁgure 3.4.

PublishKafka 2 0 takes FlowFiles and turns them into messages which are published

to a Kafka topic. As with the processor used to consume from a Kafka topic, Pub-

lishKafka 2 0 needs the IP address and port of the Kafka broker, along with the name

of the topic to which the messages should be published. Since Kafka version 2.x is

used, it is necessary to use the 2.0 version of the PublishKafka processor, and not

one of the earlier versions.

4 Experimental Setup

4.1 Introduction

In the experimentation of this project, a data ﬂow will be implemented in Apache NiFi

and its performance will be measured in a number diﬀerent tests. The data ﬂow will be

implemented in a new and alternative new setup, as described in part 3.4, and tested with

the purpose of evaluating the setups in terms of latency and throughput. We will use fake

data for the purpose of evaluating the performance of the NiFi setup, and the contents

of the data is not important, as long as the payload is ﬁxed. NiFi will have the same

three sources for data in all the tests: an SQL database, a ﬁle system, and a Kafka topic.

There will be three diﬀerent tests. For the ﬁrst two tests, the data will be delivered to the

distributed ﬁle system data sink directly from NiFi by using the new setup (as described in

part 3.4.1). In the third test the new setup will be compared to an alternative new setup,

in which data will be sent from NiFi to the distributed ﬁle system via a Kafka topic (as

described in part 3.4.2). These tests are done to answer the following questions:

•What is the FlowFile size, accomplished by merging together smaller FlowFiles, that

gives the lowest latency for the new setup?

•Is it better to combine FlowFiles from diﬀerent sources before sending them to the

distributed ﬁle system, or is it better to keep them separate?

•Is there a diﬀerence in latency and throughput between the new NiFi setup and the

alternative setup, which sends data from NiFi to a Kafka topic before the distributed

ﬁle system?

The hypotheses we have based on these questions are:

•The best FlowFile size accomplished by merging in our new setup will be somewhere

between doing no merging at all (resulting in 1 kB ﬁles) and merging the FlowFiles

together to be 128 MB, which is the default block size for HDFS. We expect the

value to be closer to 128 MB than 1 kB, since we believe the majority of the latency

will come from waiting to be moved to HDFS.

•There will not be a big diﬀerence between doing separate and combined merging, the

main diﬀerence will be that the output ﬁles look diﬀerent as they contain data from

diﬀerent sources and not just one.

•There will either not be any major diﬀerence in latency and throughput between the

two setups, or the setup which uses Kafka before sending data to the data sink will

be slightly slower, since there are more steps for the data to go through.

Part 4.2 describes the speciﬁc software used for the database and the distributed ﬁle system.

Part 4.3 describes the three diﬀerent compute nodes that are used to make up the data

ﬂow, and how these were conﬁgured. Lastly, part 4.4 explains the setup of the diﬀerent

tests that are performed, along with the metrics of the tests and how these are measured.

4.2 Additional Software Used

In addition to Apache Kafka and Apache NiFi, some other software is also used to build

the setups. There is a need for a cloud-based ﬁle system as a data sink. For this, Hadoop

Distributed File System (HDFS) was used. As an SQL database data source, MariaDB

was used.

4.2.1 Apache Hadoop and HDFS

Apache Hadoop is a software library project which contains several diﬀerent modules for

distributed processing of large sets of data. The Hadoop modules include, among others,

Hadoop YARN for scheduling of jobs and cluster resource management, and Hadoop Dis-

tributed File System. Hadoop Distributed File System is a distributed ﬁle system that is

designed to store a smaller amount of very large ﬁles (rather than many small ﬁles). HDFS

runs on a cluster of commodity hardware, and is fault-tolerant through replication of data.

HDFS uses a master-slave architecture, where a system has one NameNode as the master,

and several DataNodes acting as slaves. Usually, there is one DataNode per participant in

a cluster. The DataNodes perform read and write operations on the ﬁle system. [53, 54]

HDFS is designed to support very large ﬁles, and its performance will decrease if needs

to manage many smaller ﬁles. The default block size of HDFS as of version 2.0 is 128

MB. If a ﬁle is bigger than this, it will be split into 128 MB chunks in HDFS. If a ﬁle

is smaller than the block size, it will still take up a full block, which means that many

small ﬁles can take up much more space in memory than one large ﬁle, even through their

initial total sizes were the same. This default value can be changed in the conﬁguration

ﬁle hdfs-site.xml.1[55]

4.2.2 MariaDB

MariaDB is an open-source SQL database initially released in 2009, based on a fork of

the MySQL database system. MariaDB is included in several Linux distributions, includ-

ing Ubuntu, Debian, and Fedora. MariaDB includes features like encryption, authoriza-

tion, authentication, and logging, to provide security. It can also provide high availability

through replication and clustering. [56, 57]

To set up MariaDB for the tests, a database with one table was created. Each row in

the table contained a unique id attribute, a timestamp of the time the row was added, and

30 more attributes ﬁlled with arbitrary data. By doing so, each row contains roughly 1 000

characters, which is equivalent to approximately 1 kB of data. An SQL script (available in

Appendix A.2) was used to add rows to the database table at the same time as NiFi pulls

rows from the table, which was done to have the timestamps close to the time when the

rows are pulled into NiFi.

1This was unfortunately noticed by us very late in the project, when we did not have time to perform

further testing where the block size was changed.

4.3 Compute Nodes

During the experiment, three diﬀerent compute nodes are used to host the diﬀerent software

and systems. Each node is implemented through a Virtual Machine (VM). A visualization

of the nodes can be seen in ﬁgure 4.1. Download links to the software used can be found

in Appendix A.3.

Figure 4.1: The three compute nodes used for the experiment.

4.3.1 Node 1

The ﬁrst node is a VM which is accessed through SSH. The VM runs with Ubuntu version

18.4 as its operating system (OS), 8 GB of RAM, and 4 CPU kernels with 2.5GHz clock

frequency. Java Development Kit (JDK) version 11.0.6 is downloaded on the VM for Kafka

to be able to run. This node contains three diﬀerent data sources that send data to NiFi.

These data sources are: a standard ﬁle system, a MariaDB SQL database version 10.4.12,

and a Kafka broker and producer sending messages to a topic. The Kafka version that is

used is 2.12-2.2.0.

4.3.2 Node 2

The second node is the same kind of VM as node 1, with Ubuntu 18.04 and JDK version

1.8.0. This node hosts NiFi version 1.11.4 in a terminal, however to access NiFi’s GUI it

is necessary to use a web browser, which is not available in the VM. The GUI is accessed

through visiting [node 1 ip]:8080/nifi/ in a web browser of choice. NiFi gets data from

the sources in node 1, and sends data to the sinks in node 3.

4.3.3 Node 3

The third node is the same kind of VM as on node 1 and 2, with the same versions of Ubuntu

as node 1. The third node contains a Kafka version 2.12-2.2.0 broker and consumer, and

Hadoop version 3.2.1, which contains a HDFS module. Due to requirements from Hadoop,

the JDK version for node 3 is 1.8.0. Hadoop is set up in a pseudo-distributed mode,

meaning that it runs on a single node but each Hadoop daemon runs in a separate process,

imitating a fully distributed mode. [58] Kafka is only used on this node for the alternative

new setup used in the third test.

4.4 Experiment Description

For the evaluation of the new data pipeline setups in NiFi, three diﬀerent tests were

performed to compare and evaluate how these setups in NiFi performs in terms of latency

and throughput under diﬀerent conﬁgurations.

4.4.1 Performance Metrics

The metrics that are measured and compared in the evaluation are latency and throughput.

The latency measured in the experiments is deﬁned as the diﬀerence in time between

when a chunk of data is sent from its source and when the data arrives at its destination.

The latency is measured by comparing timestamps which are stored in the ﬁles that are

sent from MariaDB. As a row gets added to the table in MariaDB, it gets a timestamp of

the current time as one of its attributes. When the row gets turned into a FlowFile in NiFi,

the timestamp becomes part of the contents of the FlowFile. Right before the MariaDB

FlowFiles get sent to HDFS or when a FlowFile arrives at Kafka, a second timestamp is

added to the contents of the FlowFile. For the third test, these timestamps are compared

to calculate latency between creation and sending to HDFS. For the ﬁrst and second test,

a third timestamp is added after the FlowFiles have been sent to HDFS, to get a more

accurate time of when the ﬁles actually get to HDFS, and this is the timestamp used to

calculate the total latency for the ﬁrst and second tests. For these tests, the diﬀerence

between the ﬁrst and second timestamps is also used to see how the latency is distributed

over the data ﬂow, i.e. how much time is spent between row creation in MariaDB and

before being sent to HDFS, compared to the total latency. In all the tests, a sample of

50 random but evenly distributed FlowFiles are collected and used to calculate an average

latency and show how the latency is distributed. The latency results are representative

of the worst-case values for all of the measurements where FlowFiles are merged together.

This means that the timestamp that is logged as the ”starting time” is the ﬁrst timestamp

in a ﬁle. This represents the creation time of the FlowFile that had to wait in the merging

queue the longest, the one that arrived at the queue ﬁrst. It was deemed more interesting

to look at these times than the time for the FlowFile that arrived last and had to wait the

shortest time in the queue. For the case when no merging is done, each ﬁle only contains

one ”starting time”, so this is the time that is used as the ﬁrst timestamp in this case. All

three VM’s used were veriﬁed to be synchronized with regards to time.

The throughput is measured in bytes received in HDFS per second (bytes/second),

which is calculated by dividing the number of received bytes by the amount of seconds

NiFi was running. Calculating how long NiFi was running is done through subtracting the

ﬁrst timestamp of the ﬁrst ﬁle to get to HDFS from the ﬁnal timestamp of the last ﬁle to

get to HDFS.

4.4.2 Test Descriptions

For the evaluation of the data pipeline setups in NiFi, three tests will be performed.

The purpose of the ﬁrst test is to ﬁnd the merging size of FlowFiles which results in

the lowest latency in our new setup in NiFi. Files get pulled from the sources (ﬁle system,

Kafka topic, and MariaDB table), and are then separately merged with the MergeContent

processor, with diﬀerent merging conﬁgurations. The merged sizes being compared for this

test are 10 kB, 50 kB, 100 kB, 1 000 kB and 2 000 kB. For the MariaDB and ﬁle system

sources, each ﬁle is initially 1 kB large which means that 10 ﬁles are needed to get a 10 kB

FlowFile, 50 ﬁles for a 50 kB FlowFile, etc. The default size for Kafka records is set to be

100 B, since this is the approximate size of the Kafka records in the old setup. This means

that 10 times more Kafka ﬁles need to be merged together than for the other sources.

Additionally, one measurement is done when there is no merging done to the FlowFiles.

Since not all ﬁles are exactly 1 kB (or 100 B for the Kafka records) large, the MergeContent

processor’s properties are conﬁgured to accept an interval of output FlowFile sizes. If this

is not done, FlowFiles can get trapped in the queue before the MergeContent processor,

since it might be impossible to create an output FlowFile of the exact required size. The

intervals used are presented in table 4.1. These intervals were chosen to make sure that

most of the FlowFiles in the data ﬂow could pass through the merge content processor

without having to spend unnecessary time in a queue.

Table 4.1: Intervals for achieving diﬀerent merging sizes

Size (kB) Interval (kB)

10 8-12

50 40-60

100 80-110

1 000 900-1 100

2 000 1 900-2 100

The NiFi ﬂow for this test can be seen in ﬁgure 3.4 (full-size version in Appendix A.5).

Figure 4.2: NiFi data ﬂow for the second test.

The second test also consists of merging several small FlowFiles into fewer and bigger

FlowFiles. In this test, data from the three sources (MariaDB, Kafka topic, and ﬁle system),

are combined and merged together in the same MergeContent processor, see ﬁgure 4.2 (a

full-size version of the ﬁgure can be found in Appendix A.5). The purpose of this test is

to compare the latency from this test to the one measured in the ﬁrst test, in which the

sources are merged separately. For this test the size of the combined merged FlowFiles are

10 kB, 50 kB and 1 000 MB, with the same intervals of output FlowFile size set in the

MergeContent processor as in the ﬁrst test.

The third test compares the new NiFi setup (as described in part 3.4.1) with the

alternative new setup (as described in part 3.4.2) , to try to evaluate the diﬀerences in

latency and throughput between the two. In this test, the new setup is tested ﬁrst with

only MariaDB as a source, with each row in the database table being 1 kB large. Kafka,

which has now been changed to have the record size 1 kB, is then added as another source

of data packets. Lastly, the ﬁle system is added as a third 1 kB data source. In each

part of the test, the FlowFiles are separately merged together in 10 kB batches with the

same interval as the ﬁrst test, to minimize queue times to send data to HDFS. The same

Figure 4.3: NiFi data ﬂow for alternative new setup used for the third test.

procedure is done with the alternative new setup, see NiFi ﬂow in ﬁgure 4.3 (full-size version

of the ﬁgure can be found in Appendix A.5). This ﬂow has a PublishKafka 2 0 processor

instead of the PutHDFS processor in the new setup, and the ReplaceText processors have

also been removed. Instead of setting a timestamp in the FlowFiles as they leave NiFi to be

sent to HDFS (which is what happens in the new setup), a second timestamp is added in

node 3, after the ﬁles have gone through a Kafka topic and before they get sent to HDFS.

The latency that is measured in the third test is the time from row creation in MariaDB

to when the ﬁle is ready to be sent to HDFS. This is because there was some trouble

setting up a good and fast connection between Kafka and HDFS, and we did not want our

bad conﬁguration to aﬀect the results. In this test, the throughput is also measured and

compared.

5 Results & Evaluation

5.1 Introduction

This section goes through the results of the experiments which were performed. The raw

data that was used to make the graphs and plots is available in Appendix A.4. Section 5.2

presents and evaluates the results from test 1, section 5.3 goes over the results from test

2, and ﬁnally section 5.4 describes the results from test 3.

5.2 Results of Test 1

Figure 5.1: Average FlowFile latency for diﬀerent merging sizes.

As seen in ﬁgure 5.1, there is a vast diﬀerence in the latency between doing no merging

(which is expressed as merging 1 kB in the graph) and doing any kind of merging. The

average latency for sending ﬁles from NiFi to HDFS when merging is done lies between 45

and 112 milliseconds. In comparison, in the case where the FlowFiles are not merged the

ﬁles need to queue for an average of 106 331 milliseconds, almost 2 minutes, to get through

the PutHDFS processor. This large diﬀerence between doing no merging and merging to

10 kB FlowFiles was not expected. As can be seen in ﬁgure 5.1, the PutHDFS processor

works much better when putting 10 kB ﬁles to HDFS than 1 kB ﬁles, more than 10 times

better. The reason for this is not fully clear, but a speculation is that it is a sign of HDFS’s

preference for larger ﬁles over small ﬁles.

It is only for the ﬁrst measuring point that the latency between NiFi and HDFS (the

diﬀerence between the red and blue line in ﬁgure 5.1) is signiﬁcant. For all merging

measuring points the main part of the latency happens between the starting point in

MariaDB and the point in NiFi before the ﬁles are sent to HDFS, the actual sending of

ﬁles to HDFS takes relatively little time. This latency distribution can be seen more clearly

in ﬁgure 5.2. This ﬁgure shows the percentage of the total latency which is spent between

the row creation in MariaDB and before sending ﬁles to HDFS. Figure 5.2 shows more

clearly what can be seen in the diﬀerence between the two lines in ﬁgure 5.1, which is

that it is only for the measuring point for no merging that the majority of the latency for

a FlowFile is spent waiting to be sent to HDFS. Again, this is because the ﬁles in that

case have to wait for almost 2 minutes on average to get through the PutHDFS processor.

However, just because the percentage of time spent between MariaDB and the PutHDFS

queue is lower for no merging than merging, it does not mean that less time is spent at

this stage. As can be seen in ﬁgure 5.1, the actual time spent getting to the PutHDFS

processor is larger for 1 kB ﬁle size than 10 kB. This is likely a symptom of NiFi having

to handle more ﬁles when the ﬁles are not merged together.

Since the results for not merging FlowFiles are signiﬁcantly diﬀerent from, and so much

worse than, the results from merging the ﬁles, these results will be disregarded from now

on. The focus will instead be on the diﬀerences between diﬀerent sizes of merging, starting

on merging together 10 kB of FlowFiles. Based on the latency distribution seen in ﬁgure

5.2 and doing similar calculations for the other tests, in the rest of the graphs only the

total latency will be displayed. The majority of the total latency happens between row

creation in MariaDB and getting to the queue before the ﬁrst PutHDFS processor in NiFi.

Figure 5.2: Percentage of the total average FlowFile latency made up by the time between

MariaDB and NiFi, before being sent to HDFS.

Figure 5.3: Average FlowFile la-

tency for diﬀerent merging sizes.

Figure 5.4: Latency distribution for

diﬀerent merging sizes.

Figure 5.4 shows the spread of values that created the average total latency (from row

creation in MariaDB to FlowFile having been sent to HDFS) displayed in ﬁgure 5.3. The

values showed in these graphs are based on the same data as in ﬁgure 5.1, but without the

value for no merging. These graphs show how the latency for the ﬁles increase as more

kilobytes are merged together before sending the ﬁles to HDFS. This is because FlowFiles

need to wait in a queue before the MergeContent processor until enough FlowFiles have

arrived to be merged together. The larger the merged ﬁles are, the longer on average

the ﬁles will have to wait in the queue before being merged together, as more FlowFiles

are needed to get to the merging size. So while it is faster to send few and large ﬁles

to HDFS, it is slower within the NiFi ﬂow to have to merge together large numbers of

FlowFiles. There is also a memory issue in NiFi when trying to merge together very many

ﬁles. During the testing, an attempt was made to merge together 10 000 kB of FlowFiles,

but this made NiFi crash since too much data needed to be held in the memory at the same

time as the queues ﬁlled up. The FlowFile, Content, and Provenance Repositories and the

log ﬁlled up with large amounts of data, and these needed to be emptied manually before

it was possible to start NiFi up again. This lead to the conclusion that while it is better

for HDFS to get very large ﬁles (preferably up to 128 MB when this is set as the block

size), it is not feasible to merge together as many 1 kB FlowFiles in NiFi as are needed to

get to these sizes.

Based on the results of this test, 10 kB was chosen as the ”default” merging value for

the other tests, since it had the lowest average latency of the tested merging sizes. By

doing this instead of having the default be to do no merging at all, the very high latency

between NiFi and HDFS that happens when no merging is done (as shown in ﬁgure 5.1)

can be avoided.

5.3 Results of Test 2

Figure 5.5: Average FlowFile la-

tency for diﬀerent merging sizes,

comparing separate and combined

merging.

Figure 5.6: Latency distribution for

combined and separate merging.

The results of the second test are shown in ﬁgures 5.5 and 5.6. Figure 5.5 shows the

diﬀerence in average latency for separate and combined merging and how the average

latency increases as the merge size of FlowFile increases.

The diﬀerence in average latency between the separate and combined merging for a

10 kB merging size is about 1000 ms, where it is faster to do separate merging. For the

other measuring points it is faster to do combined merging than separate merging. When

the merging size is 1 000 kB, the diﬀerence in latency between the merging methods is

about 5 000 ms, with combined merging being faster. This shows that it is faster to use

combined merging than separate merging for larger FlowFile sizes, and for smaller merging

sizes there is less of a diﬀerence.

The deviating value for combined merging at 10 kB can be somewhat explained by

looking at the distribution of latency in ﬁgure 5.6. This ﬁgure shows the distribution of

latency for both combined and separate merging. This shows that the spread of values for

combined merging is much greater for 10 kB compared to separate merging. Most of the

higher latency values happened in the ﬁrst 30 seconds of the test. The reason for this is not

clear, but it is plausible to think that in a real life scenario where the NiFi ﬂow runs over

large periods of time, the latency for combined merging would stabilize at a value close to

the one that is observed with separate merging. The distributions for the other measuring

points are less spread out, which means that it is easier and safer to draw conclusions based

on these values. However, only for the 1000 kB values is there a signiﬁcant diﬀerence in

latency between separate and combined merging (and combined merging is signiﬁcantly

faster in this case). The results of the tests for 50 kB and 10 kB have overlapping intervals,

which means that the diﬀerences are not statistically signiﬁcant.

The reason that combined merging is faster than separate merging in the case where

the results are signiﬁcantly diﬀerent is that for combined merging, the queue before the

MergeContent processor is shared between the three data sources. This means that the

queue ﬁlls up to the merging size faster, and FlowFiles do not have to wait for as long to

be merged together. When setting the merging size to be large (1 000 kB in this test),

the queue waiting time for a FlowFile is relatively long when sources are being merged

separately. If the sources are merged together however, this queue time can be decreased.

This result also shows what could be seen in the ﬁrst test, that the main part of the latency

happens when waiting to be merged together, and that when this time can be decreased,

the latency gets signiﬁcantly lower.

5.4 Results of Test 3

Figure 5.7: Average throughput for the two diﬀerent sources with diﬀerent amounts of 1

kB sources.

Figure 5.7 shows the throughput results for the two diﬀerent setups used in the third test.

The throughput is roughly the same for the two setups when it comes to using one and

two sources, where the diﬀerence is less than 1 kB per second for both of these measuring

points. The only major diﬀerence is when all three sources are used, where the alternative

setup (using Kafka before sending data to HDFS) has more than 100 kB per second higher

throughput.

However, the reason for this big diﬀerence for three sources is not known. Because of

how similar the values for the two setups are for one and two data sources, it is reason-

able to think that the throughput for three sources (when the ﬁle system is added as a

source) should be around 300 kB/s for both setups. When the third test was ﬁrst per-

formed, the throughput for the alternative setup was around 700 kB/s when using the same

processors and ﬁle systems conﬁgurations as when testing the new setup (apart from the

PublishKafka 2 0 processor). Some tweaks were made to the content of the ﬁles getting

pulled from the ﬁle system, since some special characters seemed to be taking up more

space than 1 byte when being sent through Kafka. After this, the result shown in ﬁgure

5.7 was achieved. Since the expected value of about 320 kB/s could not be obtained for the

alternative setup, the throughput of about 440 kB/s was decided to be close enough. The

reason behind the deviating value is still unclear, it could depend on the implementation

of the alternative new setup, or some other impact or involvement that we do not know

about. Since this value is so deviating from the rest of the values in the test, and for no

clear reason, it will not be regarded as reliable to draw conclusions on.

Figure 5.8: Average FlowFile la-

tency for the two diﬀerent setups. Figure 5.9: Latency distribution for

the two diﬀerent setups.

The average latency for both setups are shown in 5.8. This ﬁgure shows the average

latency for each respective setup and how it changes for each source added.

When looking at the average latency, the results seem to increase some for each added

source for both the new and the alternative setup, and the new setup seems to have slightly

lower latency on average.

However, when looking at the latency distribution in ﬁgure 5.9, it is clear that the

diﬀerences in latency are not signiﬁcant. This is likely due to the fact that the implemen-

tations of the two setups are very similar, with only one more step through Kafka has to

be done for the alternative setup compared to the new setup.

One of the reasons for why there was such a small diﬀerence in performance between the

two setups might be that we did not implement a connection between Kafka and HDFS

in the alternative new setup. In this setup, NiFi sends the FlowFiles to a Kafka topic,

where they get gathered to a single ﬁle by a Python script (available in Appendix A.1).

The ﬁle then gets sent to HDFS through the terminal. If a connection between Kafka and

HDFS had been implemented, the speed of this connection could be tested against the one

between NiFi and HDFS. This needs to be taken into consideration when comparing the

performance of both setups.

5.5 Conclusion of Results

The results that were obtained through the various tests were somewhat as expected in

our hypotheses (presented in section 4.1).

The ﬁrst test gave some slightly unexpected results. The diﬀerence between not merging

FlowFiles in NiFi at all and merging them together into 10 kB FlowFiles was not expected

to be so large, as shown in ﬁgure 5.1. Other than this, it was expected that some merging

would be better than doing no merging at all, and that large merging sizes would lead to

long queues before merging in NiFi. The results showed that the lowest latency results

were found when the merging size was between 10 and 100 kB, and this result was then

used to set the default merging size to 10 kB for the following tests.

The results from the second test show that it is beneﬁcial to combine diﬀerent data

sources before sending the data to HDFS for larger merging sizes. Though it might seem

from the average latency shown in ﬁgure 5.5 that it is better to do separate merging for

small merging sizes, ﬁgure 5.6 shows that the spread of values for combined merging at a

10 kB merging size is very big, and both for a merging size of 10 kB and 50 kB the results

overlap. This means that there is no signiﬁcant diﬀerence in the latency for 10 kB or 50

kB merging size for this implementation. Since combined merging means that the data

sources share a queue for ﬁlling up the merging batches, it is generally preferred to combine

sources before merging with regards to latency, and doing this also gave signiﬁcantly better

results when the merging size was set to 1 000 kB.

The third and ﬁnal test gave mostly expected results. We were not expecting there

to be any signiﬁcant diﬀerence in the latency or throughput for the two setups being

compared, since they are so similar. The box plot in ﬁgure 5.9 shows that there is no

signiﬁcant diﬀerence in the latency for the setups. Looking at ﬁgure 5.7 makes it seem like

the alternative setup has a much higher throughput when using three data sources, but

this is thought to be due to implementation issues. This result is not deemed reliable to

draw conclusions on. Instead, the other measuring points lead to the conclusion that the

throughput increases quite linearly for each added 1 kB data source, and the diﬀerence

between the setups is negligible. When it comes to simplicity of implementation however,

the new setup wins over the alternative setup. The alternative setup needs one more step

than the new setup, and as been already mentioned, it was hard to set up a good connection

between Kafka and HDFS (something that was much easier to accomplish between NiFi

and HDFS in the new setup).

6 Conclusions

6.1 Project Summary and Evaluation

The purpose of this project was to learn about data ﬂows and data pipelines, and more

speciﬁcally the tool Apache NiFi. A data pipeline setup was designed in NiFi, which

connected an SQL database, a ﬁle system, and an Apache Kafka topic to a distributed

ﬁle system. After successfully setting this data ﬂow up, some tests were performed to see

how the system handled diﬀerent workloads, and when the best results were achieved. The

three tests were to determine what size NiFi FlowFiles should be merged into, if ﬁles from

diﬀerent data sources should be kept separate or merged together, and ﬁnally a test to

compare the data ﬂow setup with an alternative setup. The last test has a Kafka topic

as an intermediary between NiFi and the distributed ﬁle system. The results showed that

merging FlowFiles together into 10 kB ﬁles gave the lowest latency, and merging together

FlowFiles from all sources gives better latency than keeping them separate. Finally, it was

shown that there was no signiﬁcant diﬀerence between the NiFi ﬂow that sent data directly

to the distributed ﬁle system, and the ﬂow that ﬁrst sent data to a Kafka topic.

A lot of time for this project, especially in the beginning phases, went into investigating

and gathering information about Apache NiFi and the other tools used for this project.

This was necessary in order for us to understand how all the tools and systems work, before

we started setting up our implementation. At times this information gathering was quite

hard, since the information available often left much to be desired. Most of the tools only

had a too sparse or very complicated ”documentation” page on the product website, and

maybe a Wikipedia page. Luckily, NiFi itself had several oﬃcial documentation pages, and

there are also some YouTube videos from oﬃcial NiFi people with some good information.

The NiFi implementation phase of this project was relatively easy to handle. NiFi

does not require much in terms of conﬁguration to be able to work and handle data ﬂows.

The main problem in this phase was to make sure that NiFi could establish connections

to the other nodes and software. This was done through conﬁguring the properties of

each processor in NiFi and was a bit of hurdle, but it was still doable. Our biggest

implementation problem however was to learn how to use and implement Apache Kafka

and HDFS. A large part of our implementation setup went into making sure that Kafka

was working between nodes, and setting up HDFS on one node. By experiencing ourselves

how much easier NiFi is to set up than Kafka is, we deﬁnitely prefer NiFi over Kafka. The

main problem for setting up HDFS was to ﬁnd a way to simulate a distributed ﬁle system

on only one node. After some help from our mentor however, this was solved.

The other tools used for the setup, MariaDB and the (non-distributed) ﬁle system were

relatively easy to implement and set up. We had experience working with SQL databases

before the project, so setting up the MariaDB database was quite straight-forward. Setting

up a directory for the ﬁle system in one of the VM’s was also very straight-forward.

While setting up NiFi and MariaDB on their own was quite simple, setting up the con-

nection between the two took some time. The documentation on how to use the controller

service, which is needed to create this connection, is quite sparse and unclear at times.

The project has shown that it is fully possible to implement a data pipelining setup

with Apache NiFi getting data from several diﬀerent sources, and sending the data to

a HDFS sink. Even though there have been plenty of bumps along the road, the ﬁnal

implementation ended up being quite easy to manage and test.

6.2 Future Work

One interesting scenario to test against the two new setups presented in this report would

be a NiFi setup that completely cuts out Kafka from the equation. In this setup, data

from a PLC would be piped directly into NiFi, without going through Kafka ﬁrst. One

way this could be done is by using the NiFi sub-project MiNiFi. This is a setup that was

suggested by Uddeholm AB, but it had to be cut out due to time.

Another setup that the task provider suggested was to test NiFi as an ETL (Extract,

Transform, Load) tool. In this setup, NiFi would extract data from a SQL or NoSQL

database, transform it in some way (for example perform a join operation), and then load

the data back to the table. This was decided to be slightly too out of the scope for the

report, but it could still be interesting to explore.

In the setup which uses Kafka in between NiFi and HDFS, a good connection between

Kafka and HDFS was not found. Due to time limitations it was decided to instead put

data to HDFS manually (by terminal command) in the alternative setup. Setting up a

good, fast connection between these two tools, and comparing the latency between the

data source and HDFS would give a better view of how the two diﬀerent setups compare.

One thing that would be very interesting to test would be to change the block size in

HDFS. This could be done by for example recreating the ﬁrst test in this project, but with

various block sizes. We unfortunately did not have time to test this since we did not know

it was a possibility to change the block size until very late into the project.

Another expansion of the project that could be made is to perform the same tests, but

take more measuring points. For example, it would be interesting to see what the latency

in the ﬁrst test looks like at some points between no merging and 10 kB merging size.

There are many variables in both the setups we tested and in NiFi itself that could be

changed to possibly increase the performance of a data ﬂow. For example, looking more

at the diﬀerent properties in the nifi.properties ﬁle could be interesting.

Further, it would be interesting to set up the same data ﬂow in several diﬀerent plat-

forms, such as Apache Kafka, Airﬂow, Storm, etc. Seeing how these diﬀerent tools compare

in terms of latency, throughput, and simplicity of usage would be very good for someone

who is interested in setting up a data ﬂow, but unsure which tool to use.

Along the same line, an expansion of the project could be to compare the results

achieved from the new setups to the old setup. Since the task provider might change from

the old setup to the new one, it would be of interest to see how these diﬀer in terms of

performance.

Lastly, another interesting project would be to explore working with NiFi’s cluster-

ing functionality. We decided to only work with a single node of NiFi, but it would be

interesting to see how using the distributed mode of NiFi would aﬀect the performance.

References

[1] M. Gidlund, S. Han, U. Jennehag, E. Sisinni, and A. Saifullah,

“Industrial Internet of Things: Challenges, Opportunities, and Direc-

tions,” IEEE Transactions on Industrial Informatics, vol. 14, no. 11,

pp. 4724–4734, Nov. 2018. [Online]. Available: https://ieeexplore-ieee-

org.bibproxy.kau.se/stamp/stamp.jsp?tp=&arnumber=8401919&tag=1

[2] Louis Columbus, “10 Charts That Will Challenge Your Perspec-

tive Of IoT’s Growth,” 2018, [Accessed: 2020-05-12]. [Online]. Avail-

able: https://www.forbes.com/sites/louiscolumbus/2018/06/06/10-charts-that-will-

challenge-your-perspective-of-iots-growth/#6b67b71a3ecc

[3] Cisco, “Cisco Global Cloud Index: Forecast and Methodology, 2013–2018,”

2014, [Accessed: 2020-04-15]. [Online]. Available: https://www.terena.org/mail-

archives/storage/pdfVVqL9tLHLH.pdf

[4] Various, “Pipeline (computing),” 2020, [Accessed: 2020-03-31]. [Online]. Available:

https://en.wikipedia.org/wiki/Pipeline (computing)

[5] A. M˘at˘acut

,˘a and C. Popa, “Big Data Analytics: Analysis of Features and

Performance of Big Data Ingestion Tools,” Informatica Economic˘a, vol. 22, no. 2,

pp. 25–34, 2018. [Online]. Available: http://revistaie.ase.ro/content/86/03%20-

%20matacuta,%20popa.pdf

[6] J. L¨onnegren and S. Nystr¨om, “Processing data sources with

big data frameworks,” 2016. [Online]. Available: http://www.diva-

portal.org/smash/get/diva2:934359/FULLTEXT01.pdf

[7] Various, “Internet of Things,” 2020, [Accessed: 2020-04-14]. [Online]. Available:

https://en.wikipedia.org/wiki/Internet of things

[8] ——, “Industrial Internet of Things,” 2020, [Accessed: 2020-04-15]. [Online].

Available: https://en.wikipedia.org/wiki/Industrial internet of things

[9] D. A. Patterson and J. L. Hennessy, Computer Organization and Design - The Hard-

ware/Software Interface, 5th ed. Waltham, MA, US: Morgan Kaufmann, 2014.

[10] The Apache Software Foundation, “Apache Airﬂow,” 2020, [Accessed: 2020-02-18].

[Online]. Available: https://airﬂow.apache.org/

[11] ——, “Apache Spark,” 2020, [Accessed: 2020-02-18]. [Online]. Available:

https://spark.apache.org/

[12] ——, “Apache Storm,” 2020, [Accessed: 2020-02-18]. [Online]. Available:

http://storm.apache.org/index.html

[13] Microsoft, “Data Factory,” 2020, [Accessed: 2020-02-18]. [Online]. Available:

https://azure.microsoft.com/sv-se/services/data-factory/

[14] Elasticsearch B.V., “Logstash Introduction,” 2020, [Accessed: 2020-02-18]. [Online].

Available: https://www.elastic.co/logstash

[15] Evan Parker, “What is a Data Pipeline,” 2019, [Accessed: 2020-05-06]. [Online].

Available: https://www.xplenty.com/blog/what-is-a-data-pipeline/

[16] Amazon Web Services, Inc., “What is Streaming Data?” 2020, [Accessed:

2020-04-23]. [Online]. Available: https://aws.amazon.com/streaming-data/

[17] Java Platform, “Introduction to Batch Processing,” 2017, [Accessed: 2020-04-16].

[Online]. Available: https://javaee.github.io/tutorial/batch-processing001.html

[18] Various, “Apache Kafka,” 2020, [Accessed: 2020-02-10]. [Online]. Available:

https://https://en.wikipedia.org/wiki/Apache Kafka

[19] The Apache Software Foundation, “Apache Kafka - Introduction,” 2020, [Accessed:

2020-02-06]. [Online]. Available: https://kafka.apache.org/intro

[20] K. M. M. Thein, “Apache Kafka: Next Generation Distributed Messag-

ing System,” International Journal of Scientiﬁc Engineering and Technology

Research, vol. 03, no. 47, pp. 9478–9483, Dec. 2014. [Online]. Available:

http://ijsetr.com/uploads/436215IJSETR3636-621.pdf

[21] Benjamin Reed, “ProjectDescription,” 2012, [Ac-

cessed: 2020-02-10]. [Online]. Available:

https://cwiki.apache.org/conﬂuence/display/ZOOKEEPER/ProjectDescription

[22] Various, “Apache Niﬁ,” 2019, [Accessed: 2020-02-06]. [Online]. Available:

https://https://en.wikipedia.org/wiki/Apache NiFi

[23] Apache NiFi Team, “Apache NiFi Overview,” 2020, [Accessed: 2020-02-10]. [Online].

Available: https://niﬁ.apache.org/docs.html

[24] Mastercard Incorporated, “Using NIFI to simplify data ﬂow & streaming

use cases @ Mastercard,” 2018, [Accessed: 2020-02-19]. [Online]. Available:

https://www.youtube.com/watch?v=JjjjtgZIK6I

[25] Comcast Corporation, “Data Ingest Self Service and Management us-

ing Niﬁ and Kafta,” 2017, [Accessed: 2020-02-19]. [Online]. Available:

https://www.youtube.com/watch?v=YGo7Ggvaguc

[26] Hashmap Incorporated, “Powered by Apache NiFi,” 2020, [Accessed: 2020-02-19].

[Online]. Available: http://niﬁ.apache.org/powered-by-niﬁ.html

[27] Groupe Renault, “Best practices and lessons learnt from Running Apache

NiFi at Renault,” 2018, [Accessed: 2020-02-19]. [Online]. Available:

https://www.youtube.com/watch?v=rF7FV8cCYIc

[28] Apache NiFi Team, “Apache NiFi In Depth,” 2020, [Accessed: 2020-02-25]. [Online].

Available: http://niﬁ.apache.org/docs/niﬁ-docs/html/niﬁ-in-depth.html

[29] ——, “NiFi Developer’s Guide,” 2020, [Accessed: 2020-02-06]. [Online]. Available:

http://niﬁ.apache.org/developer-guide.html

[30] The Apache Software Foundation, “Apache Niﬁ - Features,” 2020, [Accessed:

2020-02-03]. [Online]. Available: https://niﬁ.apache.org/index.html

[31] Apache NiFi Team, “NiFi System Administrator’s Guide,” 2020, [Accessed: 2020-02-

12]. [Online]. Available: https://niﬁ.apache.org/docs/niﬁ-docs/html/administration-

guide.html

[32] Hortonworks, “Apache NiFi Security Reference,” 2019, [Accessed: 2020-02-

26]. [Online]. Available: https://docs.cloudera.com/HDPDocuments/HDF3/HDF-

3.4.0/niﬁ-security/hdf-niﬁ-security.pdf

[33] Bryan Bende, “Integrating Apache NiFi and Apache Kafka,” 2016, [Accessed: 2020-02-

19]. [Online]. Available: https://bryanbende.com/development/2016/09/15/apache-

niﬁ-and-apache-kafka

[34] DataWorks Summit, “Intelligently collecting data at the edge—intro

to Apache MiNiFi,” 2018, [Accessed: 2020-03-07]. [Online]. Available:

https://youtu.be/4m3Uuz3RpLg

[35] The Apache Software Foundation, “PublishKafka,”

2020, [Accessed: 2020-02-19]. [Online]. Available:

https://niﬁ.apache.org/docs/niﬁ-docs/components/org.apache.niﬁ/niﬁ-kafka-2-0-

nar/1.9.2/org.apache.niﬁ.processors.kafka.pubsub.PublishKafka 2 0/

additionalDetails.html

[36] ——, “ConsumeKafka,” 2020, [Accessed: 2020-02-19]. [Online]. Avail-

able: https://niﬁ.apache.org/docs/niﬁ-docs/components/org.apache.niﬁ/niﬁ-kafka-

2-0-nar/1.9.2/org.apache.niﬁ.processors.kafka.pubsub.ConsumeKafka 2 0/

additionalDetails.html

[37] ——, “Spark Streaming Programming Guide,” 2020, [Accessed: 2020-04-07]. [Online].

Available: https://spark.apache.org/docs/latest/streaming-programming-guide.html

[38] ——, “Apache Storm - Simple API,” 2019, [Accessed: 2020-04-07]. [Online].

Available: http://storm.apache.org/about/simple-api.html

[39] Microsoft, “Data Pipeline Pricing,” 2020, [Accessed: 2020-04-07]. [Online]. Available:

https://azure.microsoft.com/en-us/pricing/details/data-factory/data-pipeline/

[40] Elasticsearch B.V., “Logstash,” 2020, [Accessed: 2020-04-07]. [Online]. Available:

https://www.elastic.co/guide/en/logstash/current/introduction.html

[41] Various, “Programmable logic controller,” 2020, [Accessed: 2020-02-10]. [Online].

Available: https://en.wikipedia.org/wiki/Programmable logic controller

[42] The Apache Software Foundation, “PLC4x,” 2020, [Accessed: 2020-02-10]. [Online].

Available: https://plc4x.apache.org/

[43] iba AG, “iba system,” 2020, [Accessed: 2020-02-11]. [Online]. Available:

https://www.iba-ag.com/en/iba-system/

[44] ——, “ibaDatCoordinator,” 2020, [Accessed: 2020-02-11]. [Online]. Available:

https://www.iba-ag.com/en/ibadatcoordinator/

[45] Docker Inc., “Why Docker?” 2020, [Accessed: 2020-02-11]. [Online]. Available:

https://www.docker.com/why-docker

[46] ——, “The Industry-Leading Container Runtime,” 2020, [Accessed: 2020-02-11].

[Online]. Available: https://www.docker.com/products/container-runtime

[47] The Kubernetes Authors, “What is Kubernetes,” 2020, [Accessed: 2020-

02-11]. [Online]. Available: https://kubernetes.io/docs/concepts/overview/what-is-

kubernetes/

[48] Inductive Automation, “Ignition,” 2020, [Accessed: 2020-02-11]. [Online]. Available:

https://inductiveautomation.com/ignition/

[49] Conﬂuent Incorporated, “Kafka Security,” 2020, [Accessed: 2020-02-25]. [Online].

Available: https://docs.conﬂuent.io/3.0.0/kafka/security.html

[50] The Apache Software Foundation, “Apache Avro,” 2020, [Accessed: 2020-02-25].

[Online]. Available: https://avro.apache.org/docs/current/

[51] ——, “Apache Kafka - Documentation,” 2020, [Accessed: 2020-02-26]. [Online].

Available: https://kafka.apache.org/documentation/#design

[52] Python Software Foundation, “kafka-python 2.0.1,” 2020, [Accessed: 2020-04-06].

[Online]. Available: https://pypi.org/project/kafka-python/

[53] Dataﬂair team, “HDFS Tutorial – A Complete Hadoop HDFS Overview,” 2020,

[Accessed: 2020-03-12]. [Online]. Available: https://data-ﬂair.training/blogs/hadoop-

hdfs-tutorial/

[54] The Apache Software Foundation, “Apache Hadoop,” 2020, [Accessed: 2020-04-01].

[Online]. Available: http://hadoop.apache.org/

[55] Szele Balint, “The Small Files Problem,” 2009, [Accessed: 2020-04-08]. [Online].

Available: https://blog.cloudera.com/the-small-ﬁles-problem/

[56] Various, “MariaDB,” 2020, [Accessed: 2020-03-31]. [Online]. Available:

https://en.wikipedia.org/wiki/MariaDB

[57] MariaDB, “MariaDB Enterprise Server,” 2020, [Accessed: 2020-03-31]. [Online].

Available: https://mariadb.com/docs/features/mariadb-enterprise-server/

[58] The Apache Software Foundation, “Hadoop: Setting up a

single node cluster.” 2020, [Accessed: 2020-04-07]. [Online].

Available: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-

common/SingleCluster.html#Pseudo-Distributed Operation

A Appendix

All appendix content is uploaded to a GitLab repository, apart from the pictures in Ap-

pendix A.5. Speciﬁc links are given to each appendix, link to full repository is:

https://git.cse.kau.se/linavilh100/appendixapacheniﬁ.git

A.1 Python Script for Processing Kafka Messages

https://git.cse.kau.se/linavilh100/appendixapacheniﬁ/-/blob/master/pythonscript

The Python code is for messages that originated in the MariaDB table, which are the

ﬁles that contain a timestamp for when they were created. The other ﬁles were processed

with a similar script, where the only diﬀerence is that the topic name was topic3 instead

of topic2, and the ﬁle that is written to is called secondfile.txt instead of file.txt.

This was done to facilitate ﬁnding the timestamp ﬁles when extracting the raw data.

A.2 SQL Script for Loading Rows into MariaDB

https://git.cse.kau.se/linavilh100/appendixapacheniﬁ/-/blob/master/sqlscript

A.3 Software Download Links

https://git.cse.kau.se/linavilh100/appendixapacheniﬁ/-/blob/master/downloads

A.4 Raw Data

https://git.cse.kau.se/linavilh100/appendixapacheniﬁ/-/tree/master/data

A.5 Pictures

Figure A.1: Full-size version of Figure 3.4

Figure A.2: Full-size version of Figure 4.2

Figure A.3: Full-size version of Figure 4.3

0 views·76 pages

Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF Free Download

Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF free Download. Think more deeply and widely.

Uploaded by adamferguson01 on 4/10/2026

/76

100%