Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF Free Download

1 / 76
0 views76 pages

Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF Free Download

Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi PDF free Download. Think more deeply and widely.

The Faculty of Health, Science and Technology
Computer Science
Pontus Sj¨oberg, Lina Vilhelmsson
Implementation and Evaluation of a Data
Pipeline for Industrial IoT Using Apache
NiFi
Bachelor’s Project
2020:06
Implementation and Evaluation of a Data
Pipeline for Industrial IoT Using Apache
NiFi
Pontus Sj¨oberg, Lina Vilhelmsson
c
2020 The author(s) and Karlstad University
This report is submitted in partial fulfillment of the requirements
for the Bachelor’s degree in Computer Science. All material in
this report which is not my own work has been identified and
no material is included for which a degree has previously been
conferred.
Pontus Sj¨oberg
Lina Vilhelmsson
Approved, June 01, 2020
Advisor: Prof. Andreas Kassler
Examiner: Per Hurtig
iii
Abstract
In the last few years, the popularity of Industrial IoT has grown a lot, and it is expected to
have an impact of over 14 trillion USD on the global economy by 2030. One application of
Industrial IoT is using data pipelining tools to move raw data from industry machines to
data storage, where the data can be processed by analytical instruments to help optimize
the industrial operations.
This thesis analyzes and evaluates a data pipeline setup for Industrial IoT built with
the tool Apache NiFi. A data flow setup was designed in NiFi, which connected an SQL
database, a file system, and a Kafka topic to a distributed file system.
To evaluate the NiFi data pipeline setup, some tests were conducted to see how the
system performed under different workloads. The first test consisted of determining which
size to merge a FlowFile into to get the lowest latency, the second test if data from the
different data sources should be kept separate or be merged together. The third test
was to compare the NiFi setup with an alternative setup, which had a Kafka topic as an
intermediary between NiFi and the endpoint.
The first test showed that the lowest latency was achieved when merging FlowFiles
together into 10 kB files. In the second test, merging together FlowFiles from all three
sources gave a lower latency than keeping them separate for larger merging sizes. Finally,
it was shown that there was no significant difference between the two test setups.
v
Acknowledgements
We want to thank our mentor at Karlstad University, Andreas Kassler, for helping us with
writing our report and guiding us through the project. We also want to thank Erik Hallin,
our mentor at Uddeholm AB, for helping and guiding us with all the different tools and
software we used throughout the project, and giving us insight in how our implementation
might be used in an industry context. Lastly, we want to thank Uddeholm AB for letting
us do this project for them.
vi
Contents
1 Introduction 1
2 Background 3
2.1 Introduction................................... 3
2.2 Concepts..................................... 3
2.2.1 Industrial Internet of Things . . . . . . . . . . . . . . . . . . . . . . 3
2.2.2 DataPipelining............................. 4
2.2.3 DataStreaming............................. 5
2.3 ApacheKafka.................................. 6
2.3.1 Topics .................................. 6
2.3.2 Cluster.................................. 6
2.3.3 Producers ................................ 7
2.3.4 Consumers................................ 7
2.4 ApacheNiFi................................... 8
2.4.1 PrimaryComponents.......................... 9
2.4.2 Extensions................................ 11
2.4.3 Security ................................. 12
2.4.4 Cluster.................................. 13
2.4.5 Compatibility.............................. 14
2.5 NiFi as a Producer and Consumer for Kafka . . . . . . . . . . . . . . . . . 14
2.5.1 MiNiFi.................................. 14
2.5.2 NiFiasaProducer ........................... 15
2.5.3 NiFiasaConsumer........................... 16
2.6 RelatedTools .................................. 16
2.6.1 ApacheAirow ............................. 16
2.6.2 ApacheSpark.............................. 17
vii
2.6.3 ApacheStorm.............................. 17
2.6.4 AzureDataFactory........................... 18
2.6.5 Logstash................................. 18
3 Data Pipelining Architecture and Prototype 19
3.1 Introduction................................... 19
3.2 CurrentSetup.................................. 19
3.3 WhyBringinNiFi?............................... 21
3.4 NewPipeliningSetups ............................. 23
3.4.1 NewSetup................................ 23
3.4.2 Alternative New Setup . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 NiFiProcessorsUsed.............................. 26
3.5.1 Consuming from Kafka Topic . . . . . . . . . . . . . . . . . . . . . 27
3.5.2 Getting Files from File System . . . . . . . . . . . . . . . . . . . . 27
3.5.3 Getting Data from MariaDB . . . . . . . . . . . . . . . . . . . . . . 28
3.5.4 OtherProcessors ............................ 29
4 Experimental Setup 31
4.1 Introduction................................... 31
4.2 Additional Software Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Apache Hadoop and HDFS . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 MariaDB ................................ 33
4.3 ComputeNodes................................. 34
4.3.1 Node1.................................. 34
4.3.2 Node2.................................. 35
4.3.3 Node3.................................. 35
4.4 Experiment Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 35
viii
4.4.2 TestDescriptions ............................ 37
5 Results & Evaluation 40
5.1 Introduction................................... 40
5.2 ResultsofTest1 ................................ 40
5.3 ResultsofTest2 ................................ 44
5.4 ResultsofTest3 ................................ 46
5.5 ConclusionofResults.............................. 48
6 Conclusions 50
6.1 Project Summary and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 FutureWork................................... 51
References 54
A Appendix 59
A.1 Python Script for Processing Kafka Messages . . . . . . . . . . . . . . . . . 59
A.2 SQL Script for Loading Rows into MariaDB . . . . . . . . . . . . . . . . . 59
A.3 Software Download Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.4 RawData .................................... 59
A.5 Pictures ..................................... 60
ix
List of Figures
2.1 A simplified view of the Kafka architecture . . . . . . . . . . . . . . . . . . 7
2.2 NiFisGUI.................................... 9
2.3 A simplified view of the NiFi architecture . . . . . . . . . . . . . . . . . . . 9
2.4 A simplified view of a NiFi Cluster . . . . . . . . . . . . . . . . . . . . . . 14
3.1 The current setup at Uddeholm AB . . . . . . . . . . . . . . . . . . . . . . 20
3.2 New setup with NiFi sending data directly to HDFS . . . . . . . . . . . . . 24
3.3 Alternative new setup with NiFi sending data to HDFS through Kafka . . 25
3.4 The processors used in NiFi for the new setup . . . . . . . . . . . . . . . . 26
4.1 The three compute nodes used for the experiment. . . . . . . . . . . . . . . 34
4.2 NiFi data flow for the second test. . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 NiFi data flow for alternative new setup used for the third test. . . . . . . 39
5.1 Average FlowFile latency for different merging sizes. . . . . . . . . . . . . . 40
5.2 Percentage of the total average FlowFile latency made up by the time be-
tween MariaDB and NiFi, before being sent to HDFS. . . . . . . . . . . . . 42
5.3 Average FlowFile latency for different merging sizes. . . . . . . . . . . . . . 42
5.4 Latency distribution for different merging sizes. . . . . . . . . . . . . . . . 42
5.5 Average FlowFile latency for different merging sizes, comparing separate
andcombinedmerging.............................. 44
5.6 Latency distribution for combined and separate merging. . . . . . . . . . . 44
5.7 Average throughput for the two different sources with different amounts of
1kBsources. .................................. 46
5.8 Average FlowFile latency for the two different setups. . . . . . . . . . . . . 47
5.9 Latency distribution for the two different setups. . . . . . . . . . . . . . . . 47
A.1 Full-size version of Figure 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.2 Full-size version of Figure 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.3 Full-size version of Figure 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . 62
x
List of Tables
4.1 Intervals for achieving different merging sizes . . . . . . . . . . . . . . . . . 37
xi
1 Introduction
The Industrial Internet of Things, or Industrial IoT, is a subset of Internet of Things (IoT)
specific for an industrial use, and it covers the machine-to-machine and industrial com-
munication parts of IoT. [1] Industrial IoT has grown a lot in the past few years, and it
is expected by some to have an impact on the global economy of over 14 trillion USD by
the year 2030. [2] Industrial IoT focuses on integrating and interconnecting already exist-
ing devices, whereas ”consumer” IoT (e.g. smart devices) focuses more on creating new
devices. An example of an Industrial IoT application is collecting large amounts of data
from industrial machines, and sending this data to various analytical tools which then can
optimize the industrial operations, based on how the machines are performing currently.
[1, 3] One way this can be done is by the use of data pipelining tools, which are tools for
moving data from one place to another. [4]
In this project, we will try to evaluate data pipelining and data flows in the context
of Industrial IoT by creating data pipelining setups in the tool Apache NiFi (or, simply
NiFi) and try to find the best way to include NiFi into an architecture where data needs
to be moved from several starting points into a cloud-based file system. To evaluate this
setup, we will also be testing the performance of the data flow setup in NiFi with different
configurations and workloads.
Currently, there are very few scientific papers available that test the performance of
a NiFi data flow. [5, 6] Therefore, the result of this project is interesting to the task
provider Uddeholm AB, as they are looking into using NiFi as part of their data streaming
architecture. The specific tests performed in this project are designed with Uddeholm AB
in mind, to answer the questions they have about the performance of a NiFi data flow
setup.
1
The disposition of the report is as follows:
In Chapter 2, some background to the technologies, concepts, and tools used is given.
The technologies and concepts described in this chapter are Industrial IoT, data pipelining,
and data streaming. The tools Apache Kafka and Apache NiFi are also described in detail
in this chapter, along with some shorter information about some alternative data pipelining
tools. Chapter 3 describes the data pipelining prototype designed and implemented in the
project. The ideas behind the design and why Apache NiFi is used are explained. The new
setup is described both in a more abstract view, and more specifically by looking at how
the NiFi flow was set up. Chapter 4 presents the experiments that were performed to test
the NiFi implementation. A more specified description of the setup is given, explaining
the exact software that was used. The setups for the three tests that were performed
are described. Chapter 5 presents the results of the tests described in part 4, along with
evaluations of the results. Finally, Chapter 6 gives a conclusion of the report. Possible
future work is also proposed here.
2
2 Background
2.1 Introduction
This chapter gives a background to the technologies and the main tools used in the project.
The chapter starts with part 2.2, which goes through some of the important concepts and
technologies that are relevant to the project. Part 2.3 gives a description of Apache Kafka,
and part 2.4 describes Apache NiFi. Part 2.5 looks at the possibilities of combining NiFi
and Kafka. Part 2.6 introduces some other related tools.
2.2 Concepts
2.2.1 Industrial Internet of Things
The Industrial Internet of Things, or the Industrial IoT, is a subset of the Internet of
Things, IoT, which is an umbrella term for systems of interconnected computing devices
and machines which can transfer data to each other without human-to-human or human-
to-computer interaction. The popularity of IoT has increased in the last few years, with
the rise of ”smart” consumer devices such as smartphones, the Apple Watch, and Amazon
Echo. [7] This category of devices, while often colloquially called simply ”IoT” devices, is
only a part of all of IoT, and is classified in [1] as consumer IoT, to distinguish it from
Industrial IoT. The Industrial IoT subset covers the machine-to-machine and industrial
communication parts of IoT. Industrial IoT is about connecting the domains of operational
technology with information technology (IT) domains. For example, large amounts of data
collected from industrial machines can be connected to analytical instruments which can
help optimize the industrial operations. While consumer IoT focuses on creating new
devices, Industrial IoT focuses more on already existing devices, and how to integrate and
interconnect these with each other. There is also a difference in the amount of generated
data from consumer IoT and Industrial IoT. Consumer IoT data volumes are dependent
3
on the application, whereas Industrial IoT data is meant for analytics, which leads to
Industrial IoT generating very large amounts of data, up to several terabytes of data every
minute. [1, 3]
The architecture of Industrial IoT systems can be visualized as being built up of four
layers. The topmost layer is the content layer, which encompasses the user interface of the
system, such as a screen. The next layer is the service layer, which holds applications that
can perform operations on data which then can be displayed in the content layer. Next is
the network layer, which consists of physical network buses and communication protocols,
which aggregate data and then transport it to the service layer. The bottom layer is the
device layer, which contains the physical (hardware) components of the system, such as
sensors, machines, and cyber-physical systems (CPS). [8]
2.2.2 Data Pipelining
Data pipelining is an implementation technique where instructions are carried out in par-
allel instead of one after the other. In a system which does not use pipelining, each item
will have to go through all of the instructions in the system before the next item can enter
into the system. When using pipelining, each item still has to go through all the necessary
steps and instructions, but as soon as the first item moves from the first to the second
instruction, the next item in line can go through the first instruction. By using pipelining,
the latency for each individual item going through the system will stay the same, however,
the throughput of a system will be increased. [9]
Another aspect of pipelining is that the output of one process (or instruction) is the
input of the next process in the pipeline. Just like with a real-life pipe built up out of
multiple different segments, the data pipeline brings data from the input through multiple
processes which all feed into the next process in line, until the data finally reaches the
output point. [4]
The latter aspect of pipelining, where the output of one process becomes the input of
4
the next process, is the main function that a software pipelining tool provides. There are
many tools on the market for data pipelining, all with different strengths and weaknesses.
[10, 11, 12, 13, 14] This thesis will focus primarily on Apache NiFi and secondarily on
Apache Kafka, but some alternative tools are described in part 2.6.
With the use of data pipelining tools, raw data can be moved from one system to
another. Usually, this means moving data from sensors, databases, etc., and putting the
data into a data lake, the cloud, or some other data storing solution. Here, the data will
be stored and analyzed. This analysis can be performed with machine learning algorithms,
which can notice patterns and anomalies in the data. This is an example of how data
pipelining tools can be used for Industrial IoT. [1, 15]
2.2.3 Data Streaming
Data streaming is the process of data being continuously generated by many different
sources, where each data point usually is quite small (in the order of kilobytes). Streamed
data can be processed with stream processing techniques, where each piece of data in
the stream has operations done to it while the data is being streamed from one point
to another. Streaming data is a good technique to use when the goal is to continuously
analyze dynamic data and make decisions based on how the streamed data is behaving,
for example with machine learning algorithms. [16]
Stream processing can be compared to batch processing, which is a method where jobs
or tasks are processed in batches, instead of one after the other as with stream processing.
[17]
5
2.3 Apache Kafka
Apache Kafka is a distributed data streaming platform developed by LinkedIn in 2011.
[18] Kafka uses publish-subscribe techniques to stream messages. In Kafka, the publishers
are called producers, and the subscribers are known as consumers. The producers and
consumers are connected through a server, in Kafka terms called a broker, which takes in
the messages from the producers and sends them to the appropriate consumer(s). The
messages are stored and organized in different topics within the broker. Producers send
messages to the topics, and consumers read messages from the appropriate topic. [19, 20]
2.3.1 Topics
Topics are categories into which record streams are published in Kafka. Topics are multi-
subscriber, which means that each topic can have zero, one, or more consumers that
subscribe to the contents. Each topic has a partitioned log, which is an ordered sequence
of records to which new records are appended. This partitioning allows a topic to have a
log that is larger than what can fit on a single server. The records in each partition are
given unique sequence id’s called offsets, to be able to uniquely identify each record within
a partition. All published records within a topic are stored for a specified amount of time,
during which it can be read by consumers. This period of time is called the retention
period. After this time is up, the record is discarded in order to free up space in the topic.
[19]
2.3.2 Cluster
One or more brokers build up a Kafka cluster. The basic architecture of Kafka can be
seen in figure 2.1. The brokers are managed by a ZooKeeper, another Apache product,
which also counts as part of the cluster. Apache ZooKeeper is a service for coordinating
distributed systems like Apache Kafka. ZooKeeper provides a shared hierarchical name
space for coordination between distributed processes. [21]
6
Figure 2.1: A simplified view of the Kafka architecture
2.3.3 Producers
A producer in Kafka can publish data to whichever topic(s) they choose. The producer
also chooses which record should be assigned to which partition within the chosen topic.
This can be done in different ways, either using Round Robin to distribute the load evenly,
or based on some semantic partitioning function (e.g. based on some key in the record).
[19]
2.3.4 Consumers
A Kafka consumer can subscribe to one or more topics, and get the data that is published
to these topics by the producers. Consumers are divided into consumer groups, and within
these groups, only one consumer gets the records it subscribed to, and the record will
then be distributed to all members of the consumer group. Commonly, the consumers are
grouped together as ”logical subscribers”, i.e. all consumers which are subscribed to the
same set of topics create a consumer group together. Each group will usually contain many
consumer instances, which leads to higher fault tolerance. [19]
7
2.4 Apache NiFi
Apache NiFi is an open-source platform for managing data flows and data pipelines through
a system. Apache NiFi started out as Niagarafiles, developed by the American National
Security Agency (NSA) in 2006. NiFi was donated to the Apache Software Foundation
(ASF) in 2014, and in 2015 it became a Top Level Project for ASF. The design of the NiFi
platform design is based on the flow-based programming (FBP) model. The FBP model
offers features that include the ability to run within clusters, security with TLS encryption,
extensibility and improved usability features. [22] NiFi was created with IoT in mind, and
is often used in an Industrial IoT context. [23, 24, 25, 26, 27]
NiFi’s core is a data flow that is built on the main concepts of FlowFiles, FlowFile
Processors, and Connections. FlowFiles represent each object moving through the data
flow. Each FlowFile has a UUID (Universally Unique Identifier), a file name, a size, and
can also have content, although this is not necessary. FlowFile Processors is where the
actual work in NiFi is performed. The processors get access to FlowFiles, their attributes
and their content. Processors are building blocks for NiFi data flows, and they are the
most used component in NiFi. Processors can be used, for example, to create, delete,
modify, or inspect FlowFiles before sending them to the next processor or to an endpoint
outside NiFi. Processors can be grouped together into Process Groups, which are similar to
subnets in the FBP model. Each process group is a set of processors and their connections
which can receive and send data through input and output ports. By using process groups,
creation of new components is done simply by combining other components. Connections
provide linkage between processors, and they act as queues which in turn allow processors
to interact at different rates. [23]
NiFi is run through a graphical user interface (GUI) which makes it easy to visualize the
flow components (see figure 2.2), unlike some other similar tools (such as Apache Kafka)
which only use the command line interface. The GUI uses drag-and-drop techniques for
building up the data flow with processors and connections.
8
Figure 2.2: NiFi’s GUI
Figure 2.3: A simplified view of the NiFi architecture
2.4.1 Primary Components
NiFi is a Java program that is run on a Java Virtual Machine (JVM) on the hosting
server. NiFi’s primary components on the JVM are the Web Server, the Flow Controller,
Extensions (or extension points), the FlowFile Repository, the Content Repository, and
the Provenance Repository (see figure 2.3).
The Web Server hosts NiFi’s HTTP-based commands and control API.
The Flow Controller is the component that provides threads for extensions to run on.
9
It manages when extensions receive resources and when they execute. The Flow Controller
acts as a broker between processors, facilitating the exchange of FlowFiles. The extensions
will be described in further detail in part 2.4.2.
The FlowFile Repository is the component that allows NiFi to keep track of the state
of what is known about the FlowFile that is currently active in the data flow. It acts as a
Write-Ahead-Log of the metadata of each FlowFile that currently exists in the system. This
metadata includes the attributes of a FlowFile, a pointer to the content of the FlowFile (in
the Content Repository, more about this repository later in this section) and the state of the
FlowFile. Each change of a FlowFile is logged in this repository before it is executed. With
this information in the Write-Ahead-Log, NiFi is able to handle restarts and unexpected
system failures. NiFi resumes where it was stopped by checking the Write-Ahead-Log along
with the snapshot of the FlowFile in the Provenance Repository. The default approach of
the FlowFile Repository is to have the Write-Ahead Log on a specified disk partition but
the repository itself is pluggable (can be plugged in and out at will).
The actual content of a FlowFile which is currently in the NiFi flow is located in
the Content Repository. This content is stored locally on disk and is only read to the
JVM memory when needed. By doing this, the producer and consumer processors do
not need to hold on to objects in memory. When the content of a FlowFile is no longer
in use, the content will be deleted or archived. Archiving of content can be enabled or
disabled in the nifi.properties file. If the content is archived it will remain in the
Content Repository until a certain amount of time has passed, a maximum archiving
time which is set in nifi.properties. If the Content Repository is taking up too much
space, content that is archived or marked as no longer in use will be deleted, this is also
maintained in nifi.properties. The Content Repository is also pluggable but by default
the implementation of the Content Repository is a simple mechanism where blocks of data
are stored in the file system. To help reduce conflict on any single volume, more than one
file storage location can be specified to get different physical partitions.
10
In the Provenance Repository, NiFi stores all the provenance event data (metadata).
Provenance is a record of all that has happened to a FlowFile, i.e. the history of the
FlowFile. A new provenance event is created every time an event occurs to a FlowFile.
These provenance events give a snapshot of the FlowFile at that time. All of the attributes
and pointers to the FlowFile’s content are copied and stored in the provenance event, along
with the state of the FlowFile. The state can include information about the FlowFile’s
relationship with other provenance events, among other things. The provenance events
are stored in the Provenance Repository until they expire. The time until they expire is
specified in the nifi.properties file. The construct of the repository is pluggable but
by default it is implemented in such a way that it uses one or more physical disk volumes.
Event data is indexed and searchable within each location. [23, 28]
2.4.2 Extensions
As mentioned previously, NiFi has a number of extension points. These extension points
give developers the ability to add features to the platform so that NiFi meets their needs.
The most common extension points, along with the already described processors, are the
ReportingTask, the ControllerService, the FlowFilePrioritizer, and the AuthorityProvider.
[29]
The ReportingTask interface is the mechanism that allows NiFi to publish metrics,
monitoring information and internal NiFi states to endpoints. Endpoints can be log
files, e-mail, or remote web services, for example.
The ControllerService gives shared state and functionality to processors, other Con-
trollerServices and ReportingTasks in a single JVM. This could for example be used
when loading a large data set into memory. By using a ControllerService the data set
can be loaded once and given to all Processors instead of each Processor individually
loading the dataset. The ControllerService is also used if there is a need to establish
a connection to an external server.
11
The FlowFilePrioritizer interface provides prioritizing and sorting for FlowFiles that
are placed in a queue so that they can be processed in the order that is most effective
for that use case.
The AuthorityProvider is used for determining privileges and roles for a given user, if
there are any.
2.4.3 Security
NiFi can ensure a secure connection by using SSL, SSH, HTTPS and encryption of content.
[30] NiFi uses two-way SSL in the data flows from system to system but also from user to
system. In system-to-system cases NiFi uses two-way SSL in every exchange of the data
flow by encrypting and decrypting through shared keys on each side for both the sender
and the recipient. For user-to-system security, NiFi uses two-way SSL authentication to
be able to authorize a user and control the correct level of access for the user (read-only,
data flow manager, or admin). [23]
NiFi also supports authentication of users through client certificates with login and
password instead of the built-in SSL authentication. This can be done by a ”Login Iden-
tity Provider”, which is a pluggable mechanism for authenticating users with login and
password. Which Provider to use is configured in the nifi.properties file. The au-
thentication can currently be done through Apache Knox, Kerberos, OpenID connect, or
Lightweight Directory Access Protocol. [31]
NiFi also provides Multi-Tenant Authorization. Multi-Tenant Authorization means
that the authority level of a data flow is applied to each component of that data flow.
Once authenticated, users are split up into groups of users (tenants) that can command,
control or observe the data flow with different levels of authorization. If a user tries to view
or modify a NiFi resource, the system will check if the user has privileges to perform that
action. The privileges are determined by policies and these policies are managed by an
authorizer. The authorizers are configured with two properties in the nifi.properties
12
file, the nifi.authorizer.configuration.file property and the nifi.security.user
property.
The first property specifies the configuration file and it is here that the authorizers are
defined. By default the configuration is set so that the authorizers.xml file is selected.
The authorizers.xml file is used to configure and define authorizers, the default authorizer
is the StandardManagedAuthorizer. This authorizer contains a UserGroupProvider and a
AccessPolicyProvider. These providers are used to load and optionally configure the users,
groups and access policies. The StandardManagedAuthorizer will make all of the access
decisions based on the these policies.
The second property indicates which of the authorizers that are configured in the
authorizers.xml file should be used. [31, 32]
2.4.4 Cluster
Like Apache Kafka, NiFi can operate within a cluster, and NiFi does this by using a Zero-
Master Clustering paradigm. Zero-Master Clustering entails that each node in the cluster
performs the same tasks but on different sets of data. NiFi uses an embedded ZooKeeper,
see figure 2.4, which chooses a node as the Cluster Coordinator. The Cluster Coordinator
is in turn responsible for connecting and disconnecting nodes. Each node in the cluster
reports heartbeat and status of the node to the Cluster Coordinator, and the Cluster
Coordinator disconnects a node if it does not report any heartbeat status for a set amount
of time. Each cluster also has a Primary Node. The Primary Node is able to run isolated
processes. Since the rest of the nodes run all of the tasks in the data flow, the Primary
Node can run a process in isolation. It is possible to configure in a processor whether to
execute on all nodes in the cluster, or only on the Primary Node. Any potential fail-over
is handled automatically by the ZooKeeper. [29, 31]
13
Figure 2.4: A simplified view of a NiFi Cluster
2.4.5 Compatibility
Through its processors, NiFi is compatible with many other tools. NiFi can work against
several different databases, both SQL (like Apache Hive, InfluxDB, and MariaDB) and
NoSQL (MongoDB, Couchbase, DynamoDB, and HBase). NiFi can read from and write
to both regular file systems (both remotely through SSH and locally) and the distributed
file system HDFS. [23] As described in part 2.5, NiFi can use Kafka as both a sink and
source for data.
2.5 NiFi as a Producer and Consumer for Kafka
NiFi can work as both a producer and a consumer for Kafka by implementing the processors
PublishKafka and ConsumeKafka respectively. The NiFi sub-project MiNiFi can be used
as an alternative to just using PublishKafka. [33]
2.5.1 MiNiFi
MiNiFi is a sub-project of Apache NiFi. Whereas NiFi is implemented in a data center,
MiNiFi is implemented at the ”edge”, i.e. close to where the data is created at sensors or
an IoT implementation. MiNiFi does not have its own UI, instead the flows are created in
14
NiFi and then exported into MiNiFi. MiNiFi is much smaller than NiFi, and takes up less
than 100 MB of space, whereas NiFi needs several GB of space. [34]
2.5.2 NiFi as a Producer
NiFi as a producer will take a FlowFile as input and forward it to a topic at a Kafka broker.
The main way to have NiFi as a Kafka producer is by using the PublishKafka processor.
PublishKafka takes the contents of a FlowFile and puts it into a Kafka topic using the
KafkaProducer API. The contents of the FlowFile are converted to a Kafka message. The
PublishKafka processor includes an optional Message Demarcator. The demarcator is
used to decide where the separation between messages should be, and in this way it can
decide if the contents of a FlowFile should be sent as several messages or as one large
message. One large message will be the default approach if the Message Demarcator is not
set. PublishKafka will distribute the messages to a Kafka topic following the Round Robin
principle between partitions, depending on the number of partitions. If some messages for
a given FlowFile fail to send but some are successfully sent, the entire FlowFile will be
considered a failed FlowFile. The messages that fail to send will have a separate attribute
set which index the last message that was successfully ACKed by Kafka. This attribute
will allow PublishKafka to re-send the messages that have not been ACKed by Kafka.
[33, 35]
It is also possible to implement the PublishKafka processor in MiNiFi. This can be
used to get data more directly from the source to Kafka, instead of having it go through
the NiFi center.
Lastly, MiNiFi and NiFi can be combined, where MiNiFi delivers the data from the
source, and NiFi then uses the PublishKafka processor to publish messages to a Kafka
topic. [33]
15
2.5.3 NiFi as a Consumer
NiFi can also act as a Kafka consumer. In this case, the NiFi process ConsumeKafka replaces
the Kafka consumer and will handle all the data from the chosen Kafka topic(s) and deliver
it to where it needs to go. No code needs to be written to implement this, you drag and
drop the consumer process ConsumeKafka into the NiFi workflow. When Kafka sends a
message to NiFi, the ConsumeKafka process emits a FlowFile where the content of the
FlowFile is the content of the Kafka message. The ConsumeKafka processor also includes
an optional Message Demarcator. Unlike in the PublishKafka case, the demarcator in
the ConsumeKafka processor indicates that all of the messages that are received in a single
poll should be produced as one FlowFile, with the demarcator separating the messages
from each other. If the demarcator property is left blank, ConsumeKafka will produce one
FlowFile for each message received. [33, 36]
2.6 Related Tools
There are several different tools that fill similar functions as Apache NiFi, other than
Apache Kafka. This section will describe some of these tools.
2.6.1 Apache Airflow
Apache Airflow is a platform for creating and managing workflows through Python scripts.
Airflow displays the workflows as Directed Acyclic Graphs (DAGs) of tasks, which can
be easily modified through command line utilities. Since Airflow uses Python as a pro-
gramming language, libraries and classes can easily be imported for easier creation and
managing of workflows. Airflow also has a web UI which provides insight into the logs and
status of tasks in the workflow. Airflow has many built-in integrations with tools such as
Apache Hive, Spark, HDFS, MySQL, etc. Apache Kafka is not among these integrations,
however there are still ways to connect the two platforms, with a bit of work. Overall, it
16
is possible to do a lot of things with Airflow, but the user is required to write the code
for their workflows themselves. Airflow is not classified as a data streaming tool, since the
tasks do not send data between each other. [10]
2.6.2 Apache Spark
Apache Spark is a cluster computing system, which provides high-level API:s in several
different programming languages, and supports general execution graphs. Spark provides
many libraries, which allow for work with SQL, machine learning (ML), streaming, and
more. These different libraries can also easily be combined in Spark applications. The
Spark API has an extension called Spark Streaming, which is specially created for scalable,
high-throughput, and fault-tolerant stream processing of data streams. Spark Streaming
can ingest data from Kafka, Twitter, TCP sockets, and more, and can push the data to
file systems such as HDFS, databases and dashboards. Spark Streaming has no UI, and
programs are written in Scala, Java or Python. [11, 37]
2.6.3 Apache Storm
Apache Storm is used for real-time processing of data streams. Storm uses vertices in
the form of ”spouts” and ”bolts”, and edges in the form of ”streams” to visualize DAGs.
Spouts are the sources of the streams, and will generally read from any queuing system, like
Apache Kafka. It is also possible to configure a spout to create its own stream, or read from
something such as a Twitter streaming API. Bolts will then process any number of input
streams and produce output streams. Inside the bolts is where most of the computation
logic happens in Storm. The bolts can communicate with any database system. Storm can
be used with any programming language, and it can reliably process boundless streams
of data. Use cases for Apache Storm includes ETL (Extract, Transform, Load), real-time
analytics, and online ML. [12, 38]
17
2.6.4 Azure Data Factory
Azure Data Factory is a Microsoft service which is designed to integrate different data
sources. It is a platform for managing data in the cloud and on-premises. It is used to
integrate data from storage systems to data-driven workflows for ETL and ELT (Extract,
Load, Transform). Data Factory is mainly used to load the data to Microsoft’s own Azure
SQL databases. Data Factory does not have any support for writing code to run. It also
lacks the ability to add processes and is very limited to the available tools in Data Factory.
Unlike the other tools presented, no free version of Data Factory is available. The price
of Data Factory depends on how many pipelines are orchestrated and executed, data flow
executions and debugging, and the number of Data Factory operations that are being used.
[13, 39]
2.6.5 Logstash
Logstash is an open-source data processing pipeline from Elastic. Logstash ingests data
from a source (like Kafka, a file, or GitHub, to mention a few), transforms it in the
way you want, and then transports the data (or events) to a ”stash”. There are many
different stashes, for examples sending the events to a TCP socket, publishing the data
to a websocket, writing events to a Kafka topic, or writing the events to a file on disk.
Logstash can be used as a pipelining tool but it is mainly used for storing and managing
logs and events. Logstash is run through the terminal and configurations are written in
bash. It is categorized as a log management tool, whereas NiFi is categorized as a data
streaming tool [14, 40].
18
3 Data Pipelining Architecture and Prototype
3.1 Introduction
This chapter will describe how we implemented a prototype for data pipelining in Apache
NiFi. In part 3.2, a setup which does not use NiFi is described. Part 3.3 goes over the
advantages with using NiFi in the pipeline, and part 3.4 shows the experimental setups
which will be implemented in NiFi. Lastly, part 3.5 shows what our data pipeline looks
like in NiFi, and describes the different processors used to build up the pipeline.
3.2 Current Setup
The goal for this project is to evaluate a new setup for data pipelining that can be used
in an Industrial IoT context. In this part, the setup that is currently used at the task
provider, Uddeholm AB, is described. This is a setup which only uses Kafka as a tool to
transfer raw data from the source to various data sinks. This setup will from here on be
referred to as the ”old setup”.
Data collected by sensors in the industrial machines is transferred using either only
PLC4x or a combination of a traditional PLC and the data acquisition software iba (more
specifically the tool ibaDatCoordinator). This data is sent to the internal Kafka cluster,
which is containerized via Docker and orchestrated with Kubernetes. There are three Kafka
consumers: a web server, a cloud service (AWS), and the production system. The SCADA
system Ignition is used mainly for visualization. A model of the setup can be seen in figure
3.1.
19
Figure 3.1: The current setup at Uddeholm AB
PLC - A PLC, or programmable logic controller, is an industrial computer which has
been adapted in order to be used for automation, for example in an assembly line or
in robotic devices. [41]
PLC4x - PLC4x is an open-source tool from Apache. It is a universal protocol adapter
for industrial IoT. PLC4x has many built-in integration possibilities, of which Apache
NiFi and Apache Kafka are two examples. [42]
iba - iba is a system for acquiring and analysing process data. The tool ibaDatCoordinator
is used for automatic processing and managing of measurements. ibaDatCoordina-
tor provides an automatic generation of fault and quality reports, integrated status
monitoring, and it notifies when set thresholds have been met. [43, 44]
Docker - Docker is a product for containerizing apps. Via Docker, it is possible to down-
load software in containers, and these are hosted on the Docker Engine. The Docker
Engine supports all different types of applications, and it works on several different
operating systems. [45, 46]
20
Kubernetes - Kubernetes is an open-source system that can be used to group containers
into logical units to simplify management. Kubernetes can help manage the contain-
ers that run applications, and make sure that downtime is minimized. [47]
Ignition - Ignition is a platform for building and deploying industrial applications. The
Ignition software can act as a hub for all systems on the plant floor, which allows for
complete system integration. With Ignition, data from databases can be illustrated
in the form of graphs and tables. [48]
3.3 Why Bring in NiFi?
The main difference between the old setup and the proposed new setup (described in part
3.4) is that NiFi is brought in as the main data flow manager, partly replacing Kafka. In
this part, these two tools will be compared.
Both NiFi and Kafka are able to move data from one node to another, however NiFi
is classified as a data pipelining tool, whereas Kafka is classified as a distributed streaming
platform.
When comparing NiFi and Kafka one of the most obvious differences is that NiFi uses
a web-based GUI without the need for the user to write any code if they do not want
to. Kafka does not have a GUI and the user will need to write their own code to set up
consumers, producers, and topics. This makes NiFi much more user-friendly, and users
who do not have a lot of programming experience can find it much more accessible than
Kafka. Thanks to NiFi’s GUI, it is also easy to oversee the workflow, something that is
harder to do in a similar way in Kafka.
Another difference is that Kafka is made for smaller messages and NiFi is made for
larger messages and streams. Performance wise it is faster for NiFi to create a single
FlowFile out of one million messages and sending that, instead of sending one million
21
one-message FlowFiles. [33]
NiFi and Kafka are quite equal when it comes to the level of security they can provide.
As described in section 2.4.3, NiFi has several ways to provide secure connections when
sending data through the pipelines. One of these ways is to use two-way SSL authentication.
Kafka can also provide security in several ways, since the 0.9.0.0 update. These security
measures include securing connections to brokers through SSL, SASL, or SASL/PLAIN
authentication; encryption of data being sent between brokers and clients, tools, or other
brokers through SSL; and authorization of read/write operations done by clients. The
Kafka cluster can be protected by making so that users have to use SSH to connect to the
cluster with a password. These security measures are all optional, which also is the case
with NiFi. [49] Another way to provide security in Kafka, and the way it is done in the
old setup, is by using an Avro schema. The Avro schema can minimize and compress the
data, and the data can then only be read by someone who has the schema. If the schema
is placed at a different endpoint than the data, this provides encryption of the data. [50]
In Kafka, all nodes contain metadata about which servers are currently alive and where
the partition leaders of a topic are. They can share this metadata with a producer as an
answer to a request, and the producer in turn can send this data directly to the broker that
is the leader for the partition. [51] NiFi stores all the FlowFile metadata in the FlowFile
Repository and all of the provenance event data in the Provenance Repository, as explained
in section 2.4.1. It is possible to get metadata about sent messages in Kafka, but it is a
bit tricky to do and not as easy as in NiFi. In this aspect, NiFi has a clear advantage over
Kafka.
22
3.4 New Pipelining Setups
A new setup was designed as an alternative to the old setup, described in part 3.2. The
thought behind this setup is to insert NiFi between Kafka and the sink for the data, and
also to add a database and a regular file system as data sources. The method for moving
data from machines via PLC’s to Kafka will stay the same for the new setup, however for
the test simulations, fake data will be generated in Kafka. Furthermore, the new design
has only one data sink, instead of the three data sinks as seen in figure 3.1. This data
sink is a distributed file system, more specifically Hadoop Distributed File System, HDFS.
According to Erik Hallin at Uddeholm AB, the advantage with using HDFS is that it is
more scalable than the current data sinks.
An alternative setup was also designed, similar to the new setup but with Kafka inserted
between NiFi and the distributed file system. However, the main focus of the testing in
this project will be on the new setup, not the alternative new setup.
3.4.1 New Setup
The new setup has NiFi alone for pipelining data from the data sources to the HDFS
endpoint, as seen in figure 3.2. NiFi will act as a Kafka consumer for a Kafka topic
and forward Kafka messages along with data from a database and files from a remote
file system to HDFS. The Kafka producer will publish messages to a topic by use of the
kafka-producer-perf-test.sh file, which generates messages to a specified topic, with
a specified number of records (messages), record size, throughput, and using a producer
configuration file. This is used to simulate data coming in from PLCs and being pub-
lished to a Kafka topic. To make sure that NiFi can find the Kafka broker which is on
a different machine, the variable advertised.listeners in the NiFi configuration file
server.properties is set to PLAINTEXT://[kafka machine ip]:9092. Other than this,
there is no need to configure any files in Kafka in any special way, the default values work
for the purposes of this experiment.
23
Figure 3.2: New setup with NiFi sending data directly to HDFS
For Hadoop, the file core-site.xml is edited to add a configuration of the property
fs.defaultFS to be hdfs://[hdfs machine ip]:9000, and the file hdfs-site.xml is
edited to configure the property dfs.replication to be 1 and the property
dfs.permissions.enabled to be false. The latter is done to turn off permission checks,
which enables NiFi to connect to HDFS. The files hdfs-site.xml and core-site.xml are
copied and added to the machine which hosts NiFi, since these are needed to connect NiFi
and HDFS.
3.4.2 Alternative New Setup
The alternative new setup is meant to test how it will work to send the data from NiFi
to a second Kafka topic before sending it to the endpoint in HDFS, as seen in figure 3.3.
This could be interesting if one already has a connection setup between Kafka and Hadoop.
Kafka and HDFS are configured as in the main new setup, and the NiFi flow is built up
the same way, apart from the final step where the data is sent to a Kafka topic instead of
a HDFS directory.
24
Figure 3.3: Alternative new setup with NiFi sending data to HDFS through Kafka
To send the data from Kafka to HDFS, a Python script is used, which uses the kafka-
python library to read messages from a topic. [52] The script then adds the contents of
each Kafka message along with a timestamp of the current time (expressed in the Unix
time stamp format) for each message to a local file. This local file then gets uploaded to
HDFS. This method of sending one larger file instead of many small files is used in order to
work around the long time (around 2 seconds) it takes to connect to HDFS for the sending
of each file. The Python script can be found in Appendix A.1.
25
3.5 NiFi Processors Used
The NiFi processors that are used to create the new setups are described here. For some
of the tests the NiFi flow looks slightly different, all but one of the processors (Pub-
lishKafka 2 0) used for all of tests can still be seen in figure 3.4 (full-size picture can be
found in Appendix A.5). The queue sizes for the connections between the processors were
increased to be able to hold 500 000 FlowFiles to try to make sure that the connections
would not influence the results. If a queue fills up, back pressure is applied to the previous
processor, which slows down the flow.
Figure 3.4: The processors used in NiFi for the new setup. In the green box are processors
for consuming from Kafka topic, the yellow box is for getting files from the remote file
system, the red box is for pulling rows from the database. The blue box is for clearing out
HDFS, and the pink box is for putting files to the remote file system.
26
3.5.1 Consuming from Kafka Topic
ConsumeKafka 2 0 is used to consume messages from a Kafka topic. In this processor’s
properties, the IP address and port of the Kafka broker is set, along with the name
of the topic to consume data from. Using specifically ConsumeKafka 2 0 is necessary
for compatibility with the Kafka system, since it is of version 2.x. For earlier versions
of Kafka, there are other versions of the ConsumeKafka processor.
MergeContent is used to be able to merge several FlowFiles together for different setups
and configurations of the data flow. This processor is used for all three data sources
to test merging of FlowFiles. The properties Minimum Number of Entries and Max-
imum Number of Entries are both set to 1 when no merging is to happen, along with
a Minimum Group Size of 0 bytes and no Maximum Group Size set. These values
are changed to allow for merging when this is tested. The Maximum and Minimum
Group Sizes were mainly used to set size intervals for the merged FlowFiles.
PutHDFS is used for all three data sources. This processor needs the directories to the
files core-site.xml and hdfs-site.xml, which have been copied to the machine on
which NiFi is hosted. PutHDFS also needs the directory in HDFS where the files
should be put.
3.5.2 Getting Files from File System
GetSFTP is the processor which gets files from a remote file system hosted on a different
machine than NiFi is. This processor uses SFTP (SSH File Transfer Protocol) to
connect to the remote file system. To do this, the processor needs the IP address of
the remote node, the SSH port (22), and a username with a corresponding SSH key
to be able to access the file system, which is in a defined directory. The run schedule
of this processor is set to 0.01 seconds, meaning that GetSFTP will pull files from
the file system 100 times per second. The property Max Selects is set to 5 000, which
27
means that this is the maximum number of files to be pulled in a single connection.
MergeContent is configured as described in 3.5.1
PutHDFS is configured as described in 3.5.1
3.5.3 Getting Data from MariaDB
When gathering data from MariaDB to HDFS, NiFi needs to use more processors than for
the other data sources, as seen in figure 3.4. This is because SQL data needs to be handled
in another way than the other data sources. The first processor, QueryDatabaseTable,
gathers the data from a chosen SQL table and that data is converted to Avro-format
FlowFiles. To be able to process these Avro FlowFiles, NiFi needs to convert them to
JSON format FlowFiles, hence the need to use more processors. The files from MariaDB
are the files used to measure latency in the tests, and in order to do this it is also necessary
to have two ReplaceText processors in this flow.
The processors (and controller service) which are used are:
QueryDatabaseTable is used to extract incremental data based on a column from the
SQL table where the data is being gathered from. The processor gathers rows from
the table, and puts each row in a separate FlowFile. The processor needs to be
configured to know which database it should gather data from, what kind of database
and which table. This is configured in the properties of the processor. Furthermore,
the DBCPConnectionPool controller service needs to be configured for the processor
to be able to work and access the database. This controller service is described below.
The output of the QueryDatabaseTable processor is Avro format FlowFiles.
DBCPConnectionPool is the controller service that is used to obtain a connection to a
specified database. It uses a separate MariaDB Connector, which is a driver which
is used to connect Java applications with MariaDB. The controller service needs
28
this driver and the controller service needs to be configured with the database con-
nection URL, which contains the IP address along with the port and the name of
the database, the class driver name from the MariaDB Connector, location of the
MariaDB Connector along with the login credentials for the database.
ConvertAvroToJSON is used to convert the Avro format FlowFiles into JSON format
FlowFiles. The output JSON FlowFile is encoded with UTF-8 encoding. The JSON
container options property is set to none.
MergeContent is configured as described in 3.5.1.
ReplaceText is used to add a timestamp to the FlowFiles. The timestamp is in the
Unix time stamp format, expressed in milliseconds elapsed since midnight on 1st of
January, 1970. This timestamp is appended to the end of the FlowFile, and is used
to check the latency when sending data through the pipeline.
PutHDFS is configured as described in 3.5.1.
ReplaceText is used a second time to add another timestamp in the FlowFiles. This
timestamp represents the approximate time that a FlowFile gets sent to HDFS, since
the FlowFiles it gets as input are the successful files from the PutHDFS processor.
PutHDFS is used a second time to store the FlowFiles after the last timestamp is added.
These FlowFiles get sent to a different HDFS directory than the other FlowFiles,
for easier extraction of timestamps. Other than this, the processor is configured the
same way as the other PutHDFS processors.
3.5.4 Other Processors
GenerateFlowFile is a processor which generates FlowFiles which are to be put to the
remote file system. The size of these files are set to 1 kB. For the purposes of this
29
experiment, there is no need for unique FlowFiles or a custom text as the content of
the FlowFiles.
PutSFTP takes the FlowFiles from GenerateFlowFile and puts them to the remote file
system data source. Like with the GetSFTP processor, the IP address of the remote
node, the SSH port, and a username with a SSH key are needed to connect to the file
system, along with the path in the file system where the files should be put. These
are the files that will be fetched in GetSFTP.
GetHDFS is a processor which takes files from HDFS and puts them into the NiFi flow.
In our flow however, this processor is only used to empty the contents of HDFS
between testing. To make this happen, the property ”Keep Source File” is set to
false, and ”Ignore Dotted Files” is set to true, to make sure all files get removed.
The process is also set to automatically terminate all successfully retrieved files, since
they are not needed in the NiFi flow. As in PutHDFS, the paths to core-site.xml
and hdfs-site.xml are specified, as well as the directory which needs to be emptied.
For the data flow which uses Kafka as an intermediary between NiFi and HDFS (for the
alternative setup), the processor PublishKafka 2 0 is used in the places where PutHDFS
is used in figure 3.4.
PublishKafka 2 0 takes FlowFiles and turns them into messages which are published
to a Kafka topic. As with the processor used to consume from a Kafka topic, Pub-
lishKafka 2 0 needs the IP address and port of the Kafka broker, along with the name
of the topic to which the messages should be published. Since Kafka version 2.x is
used, it is necessary to use the 2.0 version of the PublishKafka processor, and not
one of the earlier versions.
30
4 Experimental Setup
4.1 Introduction
In the experimentation of this project, a data flow will be implemented in Apache NiFi
and its performance will be measured in a number different tests. The data flow will be
implemented in a new and alternative new setup, as described in part 3.4, and tested with
the purpose of evaluating the setups in terms of latency and throughput. We will use fake
data for the purpose of evaluating the performance of the NiFi setup, and the contents
of the data is not important, as long as the payload is fixed. NiFi will have the same
three sources for data in all the tests: an SQL database, a file system, and a Kafka topic.
There will be three different tests. For the first two tests, the data will be delivered to the
distributed file system data sink directly from NiFi by using the new setup (as described in
part 3.4.1). In the third test the new setup will be compared to an alternative new setup,
in which data will be sent from NiFi to the distributed file system via a Kafka topic (as
described in part 3.4.2). These tests are done to answer the following questions:
What is the FlowFile size, accomplished by merging together smaller FlowFiles, that
gives the lowest latency for the new setup?
Is it better to combine FlowFiles from different sources before sending them to the
distributed file system, or is it better to keep them separate?
Is there a difference in latency and throughput between the new NiFi setup and the
alternative setup, which sends data from NiFi to a Kafka topic before the distributed
file system?
The hypotheses we have based on these questions are:
The best FlowFile size accomplished by merging in our new setup will be somewhere
between doing no merging at all (resulting in 1 kB files) and merging the FlowFiles
31
together to be 128 MB, which is the default block size for HDFS. We expect the
value to be closer to 128 MB than 1 kB, since we believe the majority of the latency
will come from waiting to be moved to HDFS.
There will not be a big difference between doing separate and combined merging, the
main difference will be that the output files look different as they contain data from
different sources and not just one.
There will either not be any major difference in latency and throughput between the
two setups, or the setup which uses Kafka before sending data to the data sink will
be slightly slower, since there are more steps for the data to go through.
Part 4.2 describes the specific software used for the database and the distributed file system.
Part 4.3 describes the three different compute nodes that are used to make up the data
flow, and how these were configured. Lastly, part 4.4 explains the setup of the different
tests that are performed, along with the metrics of the tests and how these are measured.
4.2 Additional Software Used
In addition to Apache Kafka and Apache NiFi, some other software is also used to build
the setups. There is a need for a cloud-based file system as a data sink. For this, Hadoop
Distributed File System (HDFS) was used. As an SQL database data source, MariaDB
was used.
4.2.1 Apache Hadoop and HDFS
Apache Hadoop is a software library project which contains several different modules for
distributed processing of large sets of data. The Hadoop modules include, among others,
Hadoop YARN for scheduling of jobs and cluster resource management, and Hadoop Dis-
tributed File System. Hadoop Distributed File System is a distributed file system that is
designed to store a smaller amount of very large files (rather than many small files). HDFS
32
runs on a cluster of commodity hardware, and is fault-tolerant through replication of data.
HDFS uses a master-slave architecture, where a system has one NameNode as the master,
and several DataNodes acting as slaves. Usually, there is one DataNode per participant in
a cluster. The DataNodes perform read and write operations on the file system. [53, 54]
HDFS is designed to support very large files, and its performance will decrease if needs
to manage many smaller files. The default block size of HDFS as of version 2.0 is 128
MB. If a file is bigger than this, it will be split into 128 MB chunks in HDFS. If a file
is smaller than the block size, it will still take up a full block, which means that many
small files can take up much more space in memory than one large file, even through their
initial total sizes were the same. This default value can be changed in the configuration
file hdfs-site.xml.1[55]
4.2.2 MariaDB
MariaDB is an open-source SQL database initially released in 2009, based on a fork of
the MySQL database system. MariaDB is included in several Linux distributions, includ-
ing Ubuntu, Debian, and Fedora. MariaDB includes features like encryption, authoriza-
tion, authentication, and logging, to provide security. It can also provide high availability
through replication and clustering. [56, 57]
To set up MariaDB for the tests, a database with one table was created. Each row in
the table contained a unique id attribute, a timestamp of the time the row was added, and
30 more attributes filled with arbitrary data. By doing so, each row contains roughly 1 000
characters, which is equivalent to approximately 1 kB of data. An SQL script (available in
Appendix A.2) was used to add rows to the database table at the same time as NiFi pulls
rows from the table, which was done to have the timestamps close to the time when the
rows are pulled into NiFi.
1This was unfortunately noticed by us very late in the project, when we did not have time to perform
further testing where the block size was changed.
33
4.3 Compute Nodes
During the experiment, three different compute nodes are used to host the different software
and systems. Each node is implemented through a Virtual Machine (VM). A visualization
of the nodes can be seen in figure 4.1. Download links to the software used can be found
in Appendix A.3.
Figure 4.1: The three compute nodes used for the experiment.
4.3.1 Node 1
The first node is a VM which is accessed through SSH. The VM runs with Ubuntu version
18.4 as its operating system (OS), 8 GB of RAM, and 4 CPU kernels with 2.5GHz clock
frequency. Java Development Kit (JDK) version 11.0.6 is downloaded on the VM for Kafka
to be able to run. This node contains three different data sources that send data to NiFi.
These data sources are: a standard file system, a MariaDB SQL database version 10.4.12,
and a Kafka broker and producer sending messages to a topic. The Kafka version that is
used is 2.12-2.2.0.
34
4.3.2 Node 2
The second node is the same kind of VM as node 1, with Ubuntu 18.04 and JDK version
1.8.0. This node hosts NiFi version 1.11.4 in a terminal, however to access NiFi’s GUI it
is necessary to use a web browser, which is not available in the VM. The GUI is accessed
through visiting [node 1 ip]:8080/nifi/ in a web browser of choice. NiFi gets data from
the sources in node 1, and sends data to the sinks in node 3.
4.3.3 Node 3
The third node is the same kind of VM as on node 1 and 2, with the same versions of Ubuntu
as node 1. The third node contains a Kafka version 2.12-2.2.0 broker and consumer, and
Hadoop version 3.2.1, which contains a HDFS module. Due to requirements from Hadoop,
the JDK version for node 3 is 1.8.0. Hadoop is set up in a pseudo-distributed mode,
meaning that it runs on a single node but each Hadoop daemon runs in a separate process,
imitating a fully distributed mode. [58] Kafka is only used on this node for the alternative
new setup used in the third test.
4.4 Experiment Description
For the evaluation of the new data pipeline setups in NiFi, three different tests were
performed to compare and evaluate how these setups in NiFi performs in terms of latency
and throughput under different configurations.
4.4.1 Performance Metrics
The metrics that are measured and compared in the evaluation are latency and throughput.
The latency measured in the experiments is defined as the difference in time between
when a chunk of data is sent from its source and when the data arrives at its destination.
The latency is measured by comparing timestamps which are stored in the files that are
35
sent from MariaDB. As a row gets added to the table in MariaDB, it gets a timestamp of
the current time as one of its attributes. When the row gets turned into a FlowFile in NiFi,
the timestamp becomes part of the contents of the FlowFile. Right before the MariaDB
FlowFiles get sent to HDFS or when a FlowFile arrives at Kafka, a second timestamp is
added to the contents of the FlowFile. For the third test, these timestamps are compared
to calculate latency between creation and sending to HDFS. For the first and second test,
a third timestamp is added after the FlowFiles have been sent to HDFS, to get a more
accurate time of when the files actually get to HDFS, and this is the timestamp used to
calculate the total latency for the first and second tests. For these tests, the difference
between the first and second timestamps is also used to see how the latency is distributed
over the data flow, i.e. how much time is spent between row creation in MariaDB and
before being sent to HDFS, compared to the total latency. In all the tests, a sample of
50 random but evenly distributed FlowFiles are collected and used to calculate an average
latency and show how the latency is distributed. The latency results are representative
of the worst-case values for all of the measurements where FlowFiles are merged together.
This means that the timestamp that is logged as the ”starting time” is the first timestamp
in a file. This represents the creation time of the FlowFile that had to wait in the merging
queue the longest, the one that arrived at the queue first. It was deemed more interesting
to look at these times than the time for the FlowFile that arrived last and had to wait the
shortest time in the queue. For the case when no merging is done, each file only contains
one ”starting time”, so this is the time that is used as the first timestamp in this case. All
three VM’s used were verified to be synchronized with regards to time.
The throughput is measured in bytes received in HDFS per second (bytes/second),
which is calculated by dividing the number of received bytes by the amount of seconds
NiFi was running. Calculating how long NiFi was running is done through subtracting the
first timestamp of the first file to get to HDFS from the final timestamp of the last file to
get to HDFS.
36
4.4.2 Test Descriptions
For the evaluation of the data pipeline setups in NiFi, three tests will be performed.
The purpose of the first test is to find the merging size of FlowFiles which results in
the lowest latency in our new setup in NiFi. Files get pulled from the sources (file system,
Kafka topic, and MariaDB table), and are then separately merged with the MergeContent
processor, with different merging configurations. The merged sizes being compared for this
test are 10 kB, 50 kB, 100 kB, 1 000 kB and 2 000 kB. For the MariaDB and file system
sources, each file is initially 1 kB large which means that 10 files are needed to get a 10 kB
FlowFile, 50 files for a 50 kB FlowFile, etc. The default size for Kafka records is set to be
100 B, since this is the approximate size of the Kafka records in the old setup. This means
that 10 times more Kafka files need to be merged together than for the other sources.
Additionally, one measurement is done when there is no merging done to the FlowFiles.
Since not all files are exactly 1 kB (or 100 B for the Kafka records) large, the MergeContent
processor’s properties are configured to accept an interval of output FlowFile sizes. If this
is not done, FlowFiles can get trapped in the queue before the MergeContent processor,
since it might be impossible to create an output FlowFile of the exact required size. The
intervals used are presented in table 4.1. These intervals were chosen to make sure that
most of the FlowFiles in the data flow could pass through the merge content processor
without having to spend unnecessary time in a queue.
Table 4.1: Intervals for achieving different merging sizes
Size (kB) Interval (kB)
10 8-12
50 40-60
100 80-110
1 000 900-1 100
2 000 1 900-2 100
The NiFi flow for this test can be seen in figure 3.4 (full-size version in Appendix A.5).
37
Figure 4.2: NiFi data flow for the second test.
The second test also consists of merging several small FlowFiles into fewer and bigger
FlowFiles. In this test, data from the three sources (MariaDB, Kafka topic, and file system),
are combined and merged together in the same MergeContent processor, see figure 4.2 (a
full-size version of the figure can be found in Appendix A.5). The purpose of this test is
to compare the latency from this test to the one measured in the first test, in which the
sources are merged separately. For this test the size of the combined merged FlowFiles are
10 kB, 50 kB and 1 000 MB, with the same intervals of output FlowFile size set in the
MergeContent processor as in the first test.
The third test compares the new NiFi setup (as described in part 3.4.1) with the
alternative new setup (as described in part 3.4.2) , to try to evaluate the differences in
latency and throughput between the two. In this test, the new setup is tested first with
only MariaDB as a source, with each row in the database table being 1 kB large. Kafka,
which has now been changed to have the record size 1 kB, is then added as another source
of data packets. Lastly, the file system is added as a third 1 kB data source. In each
part of the test, the FlowFiles are separately merged together in 10 kB batches with the
same interval as the first test, to minimize queue times to send data to HDFS. The same
38
Figure 4.3: NiFi data flow for alternative new setup used for the third test.
procedure is done with the alternative new setup, see NiFi flow in figure 4.3 (full-size version
of the figure can be found in Appendix A.5). This flow has a PublishKafka 2 0 processor
instead of the PutHDFS processor in the new setup, and the ReplaceText processors have
also been removed. Instead of setting a timestamp in the FlowFiles as they leave NiFi to be
sent to HDFS (which is what happens in the new setup), a second timestamp is added in
node 3, after the files have gone through a Kafka topic and before they get sent to HDFS.
The latency that is measured in the third test is the time from row creation in MariaDB
to when the file is ready to be sent to HDFS. This is because there was some trouble
setting up a good and fast connection between Kafka and HDFS, and we did not want our
bad configuration to affect the results. In this test, the throughput is also measured and
compared.
39
5 Results & Evaluation
5.1 Introduction
This section goes through the results of the experiments which were performed. The raw
data that was used to make the graphs and plots is available in Appendix A.4. Section 5.2
presents and evaluates the results from test 1, section 5.3 goes over the results from test
2, and finally section 5.4 describes the results from test 3.
5.2 Results of Test 1
Figure 5.1: Average FlowFile latency for different merging sizes.
As seen in figure 5.1, there is a vast difference in the latency between doing no merging
(which is expressed as merging 1 kB in the graph) and doing any kind of merging. The
average latency for sending files from NiFi to HDFS when merging is done lies between 45
and 112 milliseconds. In comparison, in the case where the FlowFiles are not merged the
files need to queue for an average of 106 331 milliseconds, almost 2 minutes, to get through
the PutHDFS processor. This large difference between doing no merging and merging to
40
10 kB FlowFiles was not expected. As can be seen in figure 5.1, the PutHDFS processor
works much better when putting 10 kB files to HDFS than 1 kB files, more than 10 times
better. The reason for this is not fully clear, but a speculation is that it is a sign of HDFS’s
preference for larger files over small files.
It is only for the first measuring point that the latency between NiFi and HDFS (the
difference between the red and blue line in figure 5.1) is significant. For all merging
measuring points the main part of the latency happens between the starting point in
MariaDB and the point in NiFi before the files are sent to HDFS, the actual sending of
files to HDFS takes relatively little time. This latency distribution can be seen more clearly
in figure 5.2. This figure shows the percentage of the total latency which is spent between
the row creation in MariaDB and before sending files to HDFS. Figure 5.2 shows more
clearly what can be seen in the difference between the two lines in figure 5.1, which is
that it is only for the measuring point for no merging that the majority of the latency for
a FlowFile is spent waiting to be sent to HDFS. Again, this is because the files in that
case have to wait for almost 2 minutes on average to get through the PutHDFS processor.
However, just because the percentage of time spent between MariaDB and the PutHDFS
queue is lower for no merging than merging, it does not mean that less time is spent at
this stage. As can be seen in figure 5.1, the actual time spent getting to the PutHDFS
processor is larger for 1 kB file size than 10 kB. This is likely a symptom of NiFi having
to handle more files when the files are not merged together.
Since the results for not merging FlowFiles are significantly different from, and so much
worse than, the results from merging the files, these results will be disregarded from now
on. The focus will instead be on the differences between different sizes of merging, starting
on merging together 10 kB of FlowFiles. Based on the latency distribution seen in figure
5.2 and doing similar calculations for the other tests, in the rest of the graphs only the
total latency will be displayed. The majority of the total latency happens between row
creation in MariaDB and getting to the queue before the first PutHDFS processor in NiFi.
41
Figure 5.2: Percentage of the total average FlowFile latency made up by the time between
MariaDB and NiFi, before being sent to HDFS.
Figure 5.3: Average FlowFile la-
tency for different merging sizes.
Figure 5.4: Latency distribution for
different merging sizes.
Figure 5.4 shows the spread of values that created the average total latency (from row
creation in MariaDB to FlowFile having been sent to HDFS) displayed in figure 5.3. The
values showed in these graphs are based on the same data as in figure 5.1, but without the
value for no merging. These graphs show how the latency for the files increase as more
42
kilobytes are merged together before sending the files to HDFS. This is because FlowFiles
need to wait in a queue before the MergeContent processor until enough FlowFiles have
arrived to be merged together. The larger the merged files are, the longer on average
the files will have to wait in the queue before being merged together, as more FlowFiles
are needed to get to the merging size. So while it is faster to send few and large files
to HDFS, it is slower within the NiFi flow to have to merge together large numbers of
FlowFiles. There is also a memory issue in NiFi when trying to merge together very many
files. During the testing, an attempt was made to merge together 10 000 kB of FlowFiles,
but this made NiFi crash since too much data needed to be held in the memory at the same
time as the queues filled up. The FlowFile, Content, and Provenance Repositories and the
log filled up with large amounts of data, and these needed to be emptied manually before
it was possible to start NiFi up again. This lead to the conclusion that while it is better
for HDFS to get very large files (preferably up to 128 MB when this is set as the block
size), it is not feasible to merge together as many 1 kB FlowFiles in NiFi as are needed to
get to these sizes.
Based on the results of this test, 10 kB was chosen as the ”default” merging value for
the other tests, since it had the lowest average latency of the tested merging sizes. By
doing this instead of having the default be to do no merging at all, the very high latency
between NiFi and HDFS that happens when no merging is done (as shown in figure 5.1)
can be avoided.
43
5.3 Results of Test 2
Figure 5.5: Average FlowFile la-
tency for different merging sizes,
comparing separate and combined
merging.
Figure 5.6: Latency distribution for
combined and separate merging.
The results of the second test are shown in figures 5.5 and 5.6. Figure 5.5 shows the
difference in average latency for separate and combined merging and how the average
latency increases as the merge size of FlowFile increases.
The difference in average latency between the separate and combined merging for a
10 kB merging size is about 1000 ms, where it is faster to do separate merging. For the
other measuring points it is faster to do combined merging than separate merging. When
the merging size is 1 000 kB, the difference in latency between the merging methods is
about 5 000 ms, with combined merging being faster. This shows that it is faster to use
combined merging than separate merging for larger FlowFile sizes, and for smaller merging
sizes there is less of a difference.
The deviating value for combined merging at 10 kB can be somewhat explained by
looking at the distribution of latency in figure 5.6. This figure shows the distribution of
latency for both combined and separate merging. This shows that the spread of values for
combined merging is much greater for 10 kB compared to separate merging. Most of the
44
higher latency values happened in the first 30 seconds of the test. The reason for this is not
clear, but it is plausible to think that in a real life scenario where the NiFi flow runs over
large periods of time, the latency for combined merging would stabilize at a value close to
the one that is observed with separate merging. The distributions for the other measuring
points are less spread out, which means that it is easier and safer to draw conclusions based
on these values. However, only for the 1000 kB values is there a significant difference in
latency between separate and combined merging (and combined merging is significantly
faster in this case). The results of the tests for 50 kB and 10 kB have overlapping intervals,
which means that the differences are not statistically significant.
The reason that combined merging is faster than separate merging in the case where
the results are significantly different is that for combined merging, the queue before the
MergeContent processor is shared between the three data sources. This means that the
queue fills up to the merging size faster, and FlowFiles do not have to wait for as long to
be merged together. When setting the merging size to be large (1 000 kB in this test),
the queue waiting time for a FlowFile is relatively long when sources are being merged
separately. If the sources are merged together however, this queue time can be decreased.
This result also shows what could be seen in the first test, that the main part of the latency
happens when waiting to be merged together, and that when this time can be decreased,
the latency gets significantly lower.
45
5.4 Results of Test 3
Figure 5.7: Average throughput for the two different sources with different amounts of 1
kB sources.
Figure 5.7 shows the throughput results for the two different setups used in the third test.
The throughput is roughly the same for the two setups when it comes to using one and
two sources, where the difference is less than 1 kB per second for both of these measuring
points. The only major difference is when all three sources are used, where the alternative
setup (using Kafka before sending data to HDFS) has more than 100 kB per second higher
throughput.
However, the reason for this big difference for three sources is not known. Because of
how similar the values for the two setups are for one and two data sources, it is reason-
able to think that the throughput for three sources (when the file system is added as a
source) should be around 300 kB/s for both setups. When the third test was first per-
formed, the throughput for the alternative setup was around 700 kB/s when using the same
processors and file systems configurations as when testing the new setup (apart from the
PublishKafka 2 0 processor). Some tweaks were made to the content of the files getting
pulled from the file system, since some special characters seemed to be taking up more
46
space than 1 byte when being sent through Kafka. After this, the result shown in figure
5.7 was achieved. Since the expected value of about 320 kB/s could not be obtained for the
alternative setup, the throughput of about 440 kB/s was decided to be close enough. The
reason behind the deviating value is still unclear, it could depend on the implementation
of the alternative new setup, or some other impact or involvement that we do not know
about. Since this value is so deviating from the rest of the values in the test, and for no
clear reason, it will not be regarded as reliable to draw conclusions on.
Figure 5.8: Average FlowFile la-
tency for the two different setups. Figure 5.9: Latency distribution for
the two different setups.
The average latency for both setups are shown in 5.8. This figure shows the average
latency for each respective setup and how it changes for each source added.
When looking at the average latency, the results seem to increase some for each added
source for both the new and the alternative setup, and the new setup seems to have slightly
lower latency on average.
However, when looking at the latency distribution in figure 5.9, it is clear that the
differences in latency are not significant. This is likely due to the fact that the implemen-
tations of the two setups are very similar, with only one more step through Kafka has to
be done for the alternative setup compared to the new setup.
47
One of the reasons for why there was such a small difference in performance between the
two setups might be that we did not implement a connection between Kafka and HDFS
in the alternative new setup. In this setup, NiFi sends the FlowFiles to a Kafka topic,
where they get gathered to a single file by a Python script (available in Appendix A.1).
The file then gets sent to HDFS through the terminal. If a connection between Kafka and
HDFS had been implemented, the speed of this connection could be tested against the one
between NiFi and HDFS. This needs to be taken into consideration when comparing the
performance of both setups.
5.5 Conclusion of Results
The results that were obtained through the various tests were somewhat as expected in
our hypotheses (presented in section 4.1).
The first test gave some slightly unexpected results. The difference between not merging
FlowFiles in NiFi at all and merging them together into 10 kB FlowFiles was not expected
to be so large, as shown in figure 5.1. Other than this, it was expected that some merging
would be better than doing no merging at all, and that large merging sizes would lead to
long queues before merging in NiFi. The results showed that the lowest latency results
were found when the merging size was between 10 and 100 kB, and this result was then
used to set the default merging size to 10 kB for the following tests.
The results from the second test show that it is beneficial to combine different data
sources before sending the data to HDFS for larger merging sizes. Though it might seem
from the average latency shown in figure 5.5 that it is better to do separate merging for
small merging sizes, figure 5.6 shows that the spread of values for combined merging at a
10 kB merging size is very big, and both for a merging size of 10 kB and 50 kB the results
overlap. This means that there is no significant difference in the latency for 10 kB or 50
kB merging size for this implementation. Since combined merging means that the data
sources share a queue for filling up the merging batches, it is generally preferred to combine
48
sources before merging with regards to latency, and doing this also gave significantly better
results when the merging size was set to 1 000 kB.
The third and final test gave mostly expected results. We were not expecting there
to be any significant difference in the latency or throughput for the two setups being
compared, since they are so similar. The box plot in figure 5.9 shows that there is no
significant difference in the latency for the setups. Looking at figure 5.7 makes it seem like
the alternative setup has a much higher throughput when using three data sources, but
this is thought to be due to implementation issues. This result is not deemed reliable to
draw conclusions on. Instead, the other measuring points lead to the conclusion that the
throughput increases quite linearly for each added 1 kB data source, and the difference
between the setups is negligible. When it comes to simplicity of implementation however,
the new setup wins over the alternative setup. The alternative setup needs one more step
than the new setup, and as been already mentioned, it was hard to set up a good connection
between Kafka and HDFS (something that was much easier to accomplish between NiFi
and HDFS in the new setup).
49
6 Conclusions
6.1 Project Summary and Evaluation
The purpose of this project was to learn about data flows and data pipelines, and more
specifically the tool Apache NiFi. A data pipeline setup was designed in NiFi, which
connected an SQL database, a file system, and an Apache Kafka topic to a distributed
file system. After successfully setting this data flow up, some tests were performed to see
how the system handled different workloads, and when the best results were achieved. The
three tests were to determine what size NiFi FlowFiles should be merged into, if files from
different data sources should be kept separate or merged together, and finally a test to
compare the data flow setup with an alternative setup. The last test has a Kafka topic
as an intermediary between NiFi and the distributed file system. The results showed that
merging FlowFiles together into 10 kB files gave the lowest latency, and merging together
FlowFiles from all sources gives better latency than keeping them separate. Finally, it was
shown that there was no significant difference between the NiFi flow that sent data directly
to the distributed file system, and the flow that first sent data to a Kafka topic.
A lot of time for this project, especially in the beginning phases, went into investigating
and gathering information about Apache NiFi and the other tools used for this project.
This was necessary in order for us to understand how all the tools and systems work, before
we started setting up our implementation. At times this information gathering was quite
hard, since the information available often left much to be desired. Most of the tools only
had a too sparse or very complicated ”documentation” page on the product website, and
maybe a Wikipedia page. Luckily, NiFi itself had several official documentation pages, and
there are also some YouTube videos from official NiFi people with some good information.
The NiFi implementation phase of this project was relatively easy to handle. NiFi
does not require much in terms of configuration to be able to work and handle data flows.
The main problem in this phase was to make sure that NiFi could establish connections
50
to the other nodes and software. This was done through configuring the properties of
each processor in NiFi and was a bit of hurdle, but it was still doable. Our biggest
implementation problem however was to learn how to use and implement Apache Kafka
and HDFS. A large part of our implementation setup went into making sure that Kafka
was working between nodes, and setting up HDFS on one node. By experiencing ourselves
how much easier NiFi is to set up than Kafka is, we definitely prefer NiFi over Kafka. The
main problem for setting up HDFS was to find a way to simulate a distributed file system
on only one node. After some help from our mentor however, this was solved.
The other tools used for the setup, MariaDB and the (non-distributed) file system were
relatively easy to implement and set up. We had experience working with SQL databases
before the project, so setting up the MariaDB database was quite straight-forward. Setting
up a directory for the file system in one of the VM’s was also very straight-forward.
While setting up NiFi and MariaDB on their own was quite simple, setting up the con-
nection between the two took some time. The documentation on how to use the controller
service, which is needed to create this connection, is quite sparse and unclear at times.
The project has shown that it is fully possible to implement a data pipelining setup
with Apache NiFi getting data from several different sources, and sending the data to
a HDFS sink. Even though there have been plenty of bumps along the road, the final
implementation ended up being quite easy to manage and test.
6.2 Future Work
One interesting scenario to test against the two new setups presented in this report would
be a NiFi setup that completely cuts out Kafka from the equation. In this setup, data
from a PLC would be piped directly into NiFi, without going through Kafka first. One
way this could be done is by using the NiFi sub-project MiNiFi. This is a setup that was
suggested by Uddeholm AB, but it had to be cut out due to time.
Another setup that the task provider suggested was to test NiFi as an ETL (Extract,
51
Transform, Load) tool. In this setup, NiFi would extract data from a SQL or NoSQL
database, transform it in some way (for example perform a join operation), and then load
the data back to the table. This was decided to be slightly too out of the scope for the
report, but it could still be interesting to explore.
In the setup which uses Kafka in between NiFi and HDFS, a good connection between
Kafka and HDFS was not found. Due to time limitations it was decided to instead put
data to HDFS manually (by terminal command) in the alternative setup. Setting up a
good, fast connection between these two tools, and comparing the latency between the
data source and HDFS would give a better view of how the two different setups compare.
One thing that would be very interesting to test would be to change the block size in
HDFS. This could be done by for example recreating the first test in this project, but with
various block sizes. We unfortunately did not have time to test this since we did not know
it was a possibility to change the block size until very late into the project.
Another expansion of the project that could be made is to perform the same tests, but
take more measuring points. For example, it would be interesting to see what the latency
in the first test looks like at some points between no merging and 10 kB merging size.
There are many variables in both the setups we tested and in NiFi itself that could be
changed to possibly increase the performance of a data flow. For example, looking more
at the different properties in the nifi.properties file could be interesting.
Further, it would be interesting to set up the same data flow in several different plat-
forms, such as Apache Kafka, Airflow, Storm, etc. Seeing how these different tools compare
in terms of latency, throughput, and simplicity of usage would be very good for someone
who is interested in setting up a data flow, but unsure which tool to use.
Along the same line, an expansion of the project could be to compare the results
achieved from the new setups to the old setup. Since the task provider might change from
the old setup to the new one, it would be of interest to see how these differ in terms of
performance.
52
Lastly, another interesting project would be to explore working with NiFi’s cluster-
ing functionality. We decided to only work with a single node of NiFi, but it would be
interesting to see how using the distributed mode of NiFi would affect the performance.
53
References
[1] M. Gidlund, S. Han, U. Jennehag, E. Sisinni, and A. Saifullah,
“Industrial Internet of Things: Challenges, Opportunities, and Direc-
tions,” IEEE Transactions on Industrial Informatics, vol. 14, no. 11,
pp. 4724–4734, Nov. 2018. [Online]. Available: https://ieeexplore-ieee-
org.bibproxy.kau.se/stamp/stamp.jsp?tp=&arnumber=8401919&tag=1
[2] Louis Columbus, “10 Charts That Will Challenge Your Perspec-
tive Of IoT’s Growth,” 2018, [Accessed: 2020-05-12]. [Online]. Avail-
able: https://www.forbes.com/sites/louiscolumbus/2018/06/06/10-charts-that-will-
challenge-your-perspective-of-iots-growth/#6b67b71a3ecc
[3] Cisco, “Cisco Global Cloud Index: Forecast and Methodology, 2013–2018,”
2014, [Accessed: 2020-04-15]. [Online]. Available: https://www.terena.org/mail-
archives/storage/pdfVVqL9tLHLH.pdf
[4] Various, “Pipeline (computing),” 2020, [Accessed: 2020-03-31]. [Online]. Available:
https://en.wikipedia.org/wiki/Pipeline (computing)
[5] A. at˘acut
,˘a and C. Popa, “Big Data Analytics: Analysis of Features and
Performance of Big Data Ingestion Tools,” Informatica Economic˘a, vol. 22, no. 2,
pp. 25–34, 2018. [Online]. Available: http://revistaie.ase.ro/content/86/03%20-
%20matacuta,%20popa.pdf
[6] J. onnegren and S. Nystr¨om, “Processing data sources with
big data frameworks,” 2016. [Online]. Available: http://www.diva-
portal.org/smash/get/diva2:934359/FULLTEXT01.pdf
[7] Various, “Internet of Things,” 2020, [Accessed: 2020-04-14]. [Online]. Available:
https://en.wikipedia.org/wiki/Internet of things
[8] ——, “Industrial Internet of Things,” 2020, [Accessed: 2020-04-15]. [Online].
Available: https://en.wikipedia.org/wiki/Industrial internet of things
[9] D. A. Patterson and J. L. Hennessy, Computer Organization and Design - The Hard-
ware/Software Interface, 5th ed. Waltham, MA, US: Morgan Kaufmann, 2014.
[10] The Apache Software Foundation, “Apache Airflow,” 2020, [Accessed: 2020-02-18].
[Online]. Available: https://airflow.apache.org/
[11] ——, “Apache Spark,” 2020, [Accessed: 2020-02-18]. [Online]. Available:
https://spark.apache.org/
54
[12] ——, “Apache Storm,” 2020, [Accessed: 2020-02-18]. [Online]. Available:
http://storm.apache.org/index.html
[13] Microsoft, “Data Factory,” 2020, [Accessed: 2020-02-18]. [Online]. Available:
https://azure.microsoft.com/sv-se/services/data-factory/
[14] Elasticsearch B.V., “Logstash Introduction,” 2020, [Accessed: 2020-02-18]. [Online].
Available: https://www.elastic.co/logstash
[15] Evan Parker, “What is a Data Pipeline,” 2019, [Accessed: 2020-05-06]. [Online].
Available: https://www.xplenty.com/blog/what-is-a-data-pipeline/
[16] Amazon Web Services, Inc., “What is Streaming Data?” 2020, [Accessed:
2020-04-23]. [Online]. Available: https://aws.amazon.com/streaming-data/
[17] Java Platform, “Introduction to Batch Processing,” 2017, [Accessed: 2020-04-16].
[Online]. Available: https://javaee.github.io/tutorial/batch-processing001.html
[18] Various, “Apache Kafka,” 2020, [Accessed: 2020-02-10]. [Online]. Available:
https://https://en.wikipedia.org/wiki/Apache Kafka
[19] The Apache Software Foundation, “Apache Kafka - Introduction,” 2020, [Accessed:
2020-02-06]. [Online]. Available: https://kafka.apache.org/intro
[20] K. M. M. Thein, “Apache Kafka: Next Generation Distributed Messag-
ing System,” International Journal of Scientific Engineering and Technology
Research, vol. 03, no. 47, pp. 9478–9483, Dec. 2014. [Online]. Available:
http://ijsetr.com/uploads/436215IJSETR3636-621.pdf
[21] Benjamin Reed, “ProjectDescription,” 2012, [Ac-
cessed: 2020-02-10]. [Online]. Available:
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription
[22] Various, “Apache Nifi,” 2019, [Accessed: 2020-02-06]. [Online]. Available:
https://https://en.wikipedia.org/wiki/Apache NiFi
[23] Apache NiFi Team, “Apache NiFi Overview,” 2020, [Accessed: 2020-02-10]. [Online].
Available: https://nifi.apache.org/docs.html
[24] Mastercard Incorporated, “Using NIFI to simplify data flow & streaming
use cases @ Mastercard,” 2018, [Accessed: 2020-02-19]. [Online]. Available:
https://www.youtube.com/watch?v=JjjjtgZIK6I
[25] Comcast Corporation, “Data Ingest Self Service and Management us-
ing Nifi and Kafta,” 2017, [Accessed: 2020-02-19]. [Online]. Available:
https://www.youtube.com/watch?v=YGo7Ggvaguc
55
[26] Hashmap Incorporated, “Powered by Apache NiFi,” 2020, [Accessed: 2020-02-19].
[Online]. Available: http://nifi.apache.org/powered-by-nifi.html
[27] Groupe Renault, “Best practices and lessons learnt from Running Apache
NiFi at Renault,” 2018, [Accessed: 2020-02-19]. [Online]. Available:
https://www.youtube.com/watch?v=rF7FV8cCYIc
[28] Apache NiFi Team, “Apache NiFi In Depth,” 2020, [Accessed: 2020-02-25]. [Online].
Available: http://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
[29] ——, “NiFi Developer’s Guide,” 2020, [Accessed: 2020-02-06]. [Online]. Available:
http://nifi.apache.org/developer-guide.html
[30] The Apache Software Foundation, “Apache Nifi - Features,” 2020, [Accessed:
2020-02-03]. [Online]. Available: https://nifi.apache.org/index.html
[31] Apache NiFi Team, “NiFi System Administrator’s Guide,” 2020, [Accessed: 2020-02-
12]. [Online]. Available: https://nifi.apache.org/docs/nifi-docs/html/administration-
guide.html
[32] Hortonworks, “Apache NiFi Security Reference,” 2019, [Accessed: 2020-02-
26]. [Online]. Available: https://docs.cloudera.com/HDPDocuments/HDF3/HDF-
3.4.0/nifi-security/hdf-nifi-security.pdf
[33] Bryan Bende, “Integrating Apache NiFi and Apache Kafka,” 2016, [Accessed: 2020-02-
19]. [Online]. Available: https://bryanbende.com/development/2016/09/15/apache-
nifi-and-apache-kafka
[34] DataWorks Summit, “Intelligently collecting data at the edge—intro
to Apache MiNiFi,” 2018, [Accessed: 2020-03-07]. [Online]. Available:
https://youtu.be/4m3Uuz3RpLg
[35] The Apache Software Foundation, “PublishKafka,”
2020, [Accessed: 2020-02-19]. [Online]. Available:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-2-0-
nar/1.9.2/org.apache.nifi.processors.kafka.pubsub.PublishKafka 2 0/
additionalDetails.html
[36] ——, “ConsumeKafka,” 2020, [Accessed: 2020-02-19]. [Online]. Avail-
able: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-
2-0-nar/1.9.2/org.apache.nifi.processors.kafka.pubsub.ConsumeKafka 2 0/
additionalDetails.html
[37] ——, “Spark Streaming Programming Guide,” 2020, [Accessed: 2020-04-07]. [Online].
Available: https://spark.apache.org/docs/latest/streaming-programming-guide.html
56
[38] ——, “Apache Storm - Simple API,” 2019, [Accessed: 2020-04-07]. [Online].
Available: http://storm.apache.org/about/simple-api.html
[39] Microsoft, “Data Pipeline Pricing,” 2020, [Accessed: 2020-04-07]. [Online]. Available:
https://azure.microsoft.com/en-us/pricing/details/data-factory/data-pipeline/
[40] Elasticsearch B.V., “Logstash,” 2020, [Accessed: 2020-04-07]. [Online]. Available:
https://www.elastic.co/guide/en/logstash/current/introduction.html
[41] Various, “Programmable logic controller,” 2020, [Accessed: 2020-02-10]. [Online].
Available: https://en.wikipedia.org/wiki/Programmable logic controller
[42] The Apache Software Foundation, “PLC4x,” 2020, [Accessed: 2020-02-10]. [Online].
Available: https://plc4x.apache.org/
[43] iba AG, “iba system,” 2020, [Accessed: 2020-02-11]. [Online]. Available:
https://www.iba-ag.com/en/iba-system/
[44] ——, “ibaDatCoordinator,” 2020, [Accessed: 2020-02-11]. [Online]. Available:
https://www.iba-ag.com/en/ibadatcoordinator/
[45] Docker Inc., “Why Docker?” 2020, [Accessed: 2020-02-11]. [Online]. Available:
https://www.docker.com/why-docker
[46] ——, “The Industry-Leading Container Runtime,” 2020, [Accessed: 2020-02-11].
[Online]. Available: https://www.docker.com/products/container-runtime
[47] The Kubernetes Authors, “What is Kubernetes,” 2020, [Accessed: 2020-
02-11]. [Online]. Available: https://kubernetes.io/docs/concepts/overview/what-is-
kubernetes/
[48] Inductive Automation, “Ignition,” 2020, [Accessed: 2020-02-11]. [Online]. Available:
https://inductiveautomation.com/ignition/
[49] Confluent Incorporated, “Kafka Security,” 2020, [Accessed: 2020-02-25]. [Online].
Available: https://docs.confluent.io/3.0.0/kafka/security.html
[50] The Apache Software Foundation, “Apache Avro,” 2020, [Accessed: 2020-02-25].
[Online]. Available: https://avro.apache.org/docs/current/
[51] ——, “Apache Kafka - Documentation,” 2020, [Accessed: 2020-02-26]. [Online].
Available: https://kafka.apache.org/documentation/#design
[52] Python Software Foundation, “kafka-python 2.0.1,” 2020, [Accessed: 2020-04-06].
[Online]. Available: https://pypi.org/project/kafka-python/
57
[53] Dataflair team, “HDFS Tutorial A Complete Hadoop HDFS Overview,” 2020,
[Accessed: 2020-03-12]. [Online]. Available: https://data-flair.training/blogs/hadoop-
hdfs-tutorial/
[54] The Apache Software Foundation, “Apache Hadoop,” 2020, [Accessed: 2020-04-01].
[Online]. Available: http://hadoop.apache.org/
[55] Szele Balint, “The Small Files Problem,” 2009, [Accessed: 2020-04-08]. [Online].
Available: https://blog.cloudera.com/the-small-files-problem/
[56] Various, “MariaDB,” 2020, [Accessed: 2020-03-31]. [Online]. Available:
https://en.wikipedia.org/wiki/MariaDB
[57] MariaDB, “MariaDB Enterprise Server,” 2020, [Accessed: 2020-03-31]. [Online].
Available: https://mariadb.com/docs/features/mariadb-enterprise-server/
[58] The Apache Software Foundation, “Hadoop: Setting up a
single node cluster.” 2020, [Accessed: 2020-04-07]. [Online].
Available: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
common/SingleCluster.html#Pseudo-Distributed Operation
58
A Appendix
All appendix content is uploaded to a GitLab repository, apart from the pictures in Ap-
pendix A.5. Specific links are given to each appendix, link to full repository is:
https://git.cse.kau.se/linavilh100/appendixapachenifi.git
A.1 Python Script for Processing Kafka Messages
https://git.cse.kau.se/linavilh100/appendixapachenifi/-/blob/master/pythonscript
The Python code is for messages that originated in the MariaDB table, which are the
files that contain a timestamp for when they were created. The other files were processed
with a similar script, where the only difference is that the topic name was topic3 instead
of topic2, and the file that is written to is called secondfile.txt instead of file.txt.
This was done to facilitate finding the timestamp files when extracting the raw data.
A.2 SQL Script for Loading Rows into MariaDB
https://git.cse.kau.se/linavilh100/appendixapachenifi/-/blob/master/sqlscript
A.3 Software Download Links
https://git.cse.kau.se/linavilh100/appendixapachenifi/-/blob/master/downloads
A.4 Raw Data
https://git.cse.kau.se/linavilh100/appendixapachenifi/-/tree/master/data
59
A.5 Pictures
Figure A.1: Full-size version of Figure 3.4
60
Figure A.2: Full-size version of Figure 4.2
61
Figure A.3: Full-size version of Figure 4.3
62