Tanzu Greenplum Connector for Apache NiFi PDF Free Download

1 / 41
0 views41 pages

Tanzu Greenplum Connector for Apache NiFi PDF Free Download

Tanzu Greenplum Connector for Apache NiFi PDF free Download. Think more deeply and widely.

Tanzu Greenplum
Connector for Apache NiFi
Tanzu Greenplum Connector for Apache NiFi 1.1
You can find the most up-to-date technical documentation on the VMware by Broadcom website at:
https://techdocs.broadcom.com/
VMware by Broadcom
3401 Hillview Ave.
Palo Alto, CA 94304
www.vmware.com
Copyright © 2025 Broadcom. All Rights Reserved. The term “Broadcom” refers to Broadcom Inc. and/or its
subsidiaries. For more information, go to https://www.broadcom.com. All trademarks, trade names, service marks,
and logos referenced herein belong to their respective companies.
Tanzu Greenplum Connector for Apache NiFi
2
Contents
5
6
6
6
6
6
7
7
7
7
7
7
7
7
9
10
11
11
11
12
13
13
13
13
15
15
16
16
18
18
18
18
20
20
21
21
21
22
VMware Greenplum Connector for Apache NiFi 1.1
Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VMware Greenplum Connector for Apache NiFi 1.x Release Notes
Supported Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Upgrading to Version 1.x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Release 1.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Resolved Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Release 1.1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Resolved Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Release 1.0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Release 1.0.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Known Issues and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Real-Time Data Ingestion with Greenplum Database and Apache
NiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Installing the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Downloading the Connector Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Registering the Connector with Apache NiFi . . . . . . . . . . . . . . . . . . . . . . . . . . .
Upgrading the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Downloading the Connector Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Registering the Connector with Apache NiFi . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using the Apache NiFi User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . .
Launching the Apache NiFi User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Creating a DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adding a Processor to the Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
About the Context Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Connecting Processors in a DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Starting the DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Additional Apache NiFi References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Loading Data with the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Configuring the Greenplum Adapter Controller Service . . . . . . . . . . . . . . . . . .
Identifying the Input Data Format and Schema . . . . . . . . . . . . . . . . . . . . . . . .
About the Data Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Configuring the Record Reader Controller Service . . . . . . . . . . . . . . . . . . . . . .
Tanzu Greenplum Connector for Apache NiFi
3
22
22
23
23
24
24
24
24
26
26
27
27
28
30
32
33
33
33
34
35
36
36
37
38
39
39
Creating the Target Greenplum Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Building the DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Starting the DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Checking the Load Operation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Configuring the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PutGreenplumRecord Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PutGreenplumRecord Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PutGreenplumRecord Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
About the Insert, Merge, and Update Properties . . . . . . . . . . . . . . . . . . . . . . . . . .
Specifying Field and Column Mapping Properties . . . . . . . . . . . . . . . . . . . . . . . . . .
Specifying Failure Rollback Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Choosing a Maximum Record Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
About the Record Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
About the Greenplum Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example - Loading CSV Data from the File System . . . . . . . . . . . . . .
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Step 1: Prepare the Example Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Step 2: Add and Congure the GetFile Processor . . . . . . . . . . . . . . . . . . . . . . .
Step 3: Configure a GreenplumGPSSAdapter Controller Service . . . . . . . . . . . .
Step 4: Identify the Input Data Source, Format, and Schema . . . . . . . . . . . . .
Step 5: Configure a Record Reader Controller Service . . . . . . . . . . . . . . . . . . .
Step 6: Add and Congure the PutGreenplumRecord Processor . . . . . . . . . . .
Step 7: Connect and Start the Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Step 8: Create the Greenplum Database and Table . . . . . . . . . . . . . . . . . . . . .
Step 9: Trigger the Flow and Check Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tanzu Greenplum Connector for Apache NiFi
4
VMware Greenplum Connector for Apache
NiFi 1.1 Documentation
This documentation describes how to install, configure, and use the VMware Greenplum Connector for
Apache NiFi.
Key topics in the VMware Greenplum Connector for Apache NiFi Documentation include:
Release Notes
Overview of the Connector
Installing the Connector
Using the Apache NiFi User Interface
Loading Data with the Connector
Configuring the Connector
Example: Loading CSV Data from the File System
Tanzu Greenplum Connector for Apache NiFi
5
VMware Greenplum Connector for Apache
NiFi 1.x Release Notes
You can use the VMware Greenplum Connector for Apache NiFi in an Apache NiFi dataflow to load record-
oriented data from any source into VMware Greenplum. The Connector uses the VMware Greenplum
Streaming Server to load the data in parallel.
Supported Platforms
The following table identifies the supported component versions for the VMware Greenplum Connector for
Apache NiFi version 1.0.x:
Connector Version Apache NiFi Version Greenplum Version Greenplum Streaming Server Version
1.1.0, 1.0.1 1.10.x, 1.11.x, 1.12.x 6.x 1.4.1+1
1.0.0 1.10.x, 1.11.x, 1.12.x 6.x 1.4.1 - 1.6.0
1 Refer to Known Issues and Limitations for specific version caveats.
Component documentation references:
Apache NiFi
VMware Greenplum
VMware Greenplum Streaming Server
Upgrading to Version 1.x
If you are currently using the VMware Greenplum Connector for Apache NiFi, you may be required to
perform upgrade actions for this release. Review Upgrading the Connector to plan your upgrade.
Release 1.1.1
Release Date: December 20, 2023
Resolved Issues
VMware Greenplum Connector for Apache NiFi 1.1.1 resolves these issues:
N/A
Resolves CVE-2022-42889 by ceasing shipment of Apache commons-text library.
N/A
Tanzu Greenplum Connector for Apache NiFi
6
Resolves CVE-2023-1428, CVE-2023-32731, and CVE-2023-32731 by updating gRPC libraries (and
their transitive dependencies) to version 1.53.0.
Release 1.1.0
Release Date: September 9, 2022
Changes
When a column in the target Greenplum Database table has no matching field in an incoming NiFi record
and the PutGreenplumRecord processor Unmatched Column Behavior configuration setting specifies
either Ignore Unmatched Columns or Warn on Unmatched Columns, the VMware Greenplum Connector for
Apache NiFi sends no column value for the record to GPSS. If the Greenplum table definition includes a
default value for the column, Greenplum Database writes the tuple with the default column value; otherwise
Greenplum writes the tuple with a NULL column value.
Resolved Issues
VMware Greenplum Connector for Apache NiFi 1.1.0 resolves these issues:
32345
Resolves an issue where a default column value was not applied when a NiFi record had no
matching field; see Changes above.
Release 1.0.1
Release Date: June 14, 2022
Changes
VMware Greenplum Connector for Apache NiFi version 1.0.1 adds support for Greenplum Streaming Server
versions 1.6.1+.
Release 1.0.0
Release Date: December 16, 2020
Features
The VMware Greenplum Connector for Apache NiFi 1.0.0 release supports inserting, merging, and updating
record-oriented data in Greenplum Database. You can set up an Apache NiFi dataflow that sends Avro,
CSV, Parquet, JSON, or XML data to the Connector PutGreenplumRecord processor to write to a
Greenplum table.
Known Issues and Limitations
The VMware Greenplum Connector for Apache NiFi 1.0.x has these known issues and limitations:
Using the Connector with Greenplum Streaming Server versions prior to v1.7.1 when loading
timestamp without timezone type data into Greenplum Database may expose a data type
Tanzu Greenplum Connector for Apache NiFi
7
conversion issue. This issue is resolved in Greenplum Streaming Server version 1.7.1;
be sure to
use the VMware Greenplum Connector for Apache NiFi version 1.0.1+ should you encounter this
scenario.
If you run the Connector against Greenplum Streaming Server version 1.4.x or older, you must
specify "ReuseTables": false in the gpss.json server configuration file that you use to start the
Greenplum Streaming Server.
The Connector does not support SSL-encrypted connections to the Greenplum Streaming Server.
A dataflow that you create that routes Avro data from the QueryDatabaseTableRecord processor to
the PutGreenplumRecord processor must set the Use Avro Logical Types property to true to
send date and time data in the format supported by the Connector.
Apache NiFi bug NIFI-7817 causes the built-in ParquetReader processor to fail when you use it
with Apache NiFi version 1.12.
Decimal types may lose precision when you use the Connector with Apache NiFi versions 1.10 and
1.11.
Tanzu Greenplum Connector for Apache NiFi
8
Real-Time Data Ingestion with Greenplum
Database and Apache NiFi
VMware Greenplum is a massively parallel processing database server specially designed to manage large
scale analytic data warehouses and business intelligence workloads. Apache NiFi is a framework that
provides an interactive user interface through which you create and manage automated dataflows between
systems. The VMware Greenplum Connector for Apache NiFi provides organizations a fast and simple way
to build data ingestion pipelines for Greenplum Database, code-free.
You can use the web-based Apache NiFi user interface and built-in NiFi processors to set up a data pipeline
that employs the Connector’s PutGreenplumRecord processor to load record-oriented data into Greenplum
Database for subsequent analytics.
The Connector:
Utilizes the drag-and-drop-based Apache NiFi user interface for component and data pipeline
configuration.
Supports CSV, Avro, Parquet, JSON, and XML input data formats using built-in NiFi Record
Readers.
Converts NiFi records into Greenplum tuples.
Loads the tuples into Greenplum Database.
The VMware Greenplum Connector for Apache NiFi uses the Greenplum Streaming Server to load data in
parallel into Greenplum Database. This facilitates higher concurrency and throughput during data ingestion
compared to a JDBC-based NiFi processor, with less load on the Greenplum Database master host.
Tanzu Greenplum Connector for Apache NiFi
9
Next Steps
Install the Connector and register it with Apache NiFi.
Review an introduction to the Apache NiFi user interface in Using the Apache NiFi User Interface.
Examine the data load procedure described in Loading Data with the Connector.
Try out the Loading CSV Data from the File System example.
Tanzu Greenplum Connector for Apache NiFi
10
Installing the Connector
The VMware Greenplum Connector for Apache NiFi is available as a separate download for VMware
Greenplum 6.x from Broadcom Support Portal.
Prerequisites
You can run a single instance of Apache NiFi, or run NiFi in a clustered environment. Before installing the
Connector, ensure that you meet the following prerequisites:
Apache NiFi requires Java version 8 or 11. Install Java 8 or 11 on your NiFi host(s).
You have installed Apache NiFi on your single or cluster host(s). Refer to Downloading and
Installing NiFi in the Apache NiFi documentation for instructions.
You have administrative access to the NiFi host(s).
Downloading the Connector Package
The Connector is available as a separate download for Greenplum Database 6.x from Broadcom Support
Portal. The Connector download package is a .tar.gz file; it includes the Apache NiFi NAR files for the
Connector and an installation script.
Perform these steps to download the Connector package:
1. Navigate to the Greenplum Database product on Broadcom Support Portal, select
Greenplum
Connector for Apache NiFi 1.1.0
under the desired Greenplum release.
The format of the Connector download file name is greenplum-connector-apache-nifi-
<version>.tar.gz. For example:
greenplum-connector-apache-nifi-1.1.0.tar.gz
2. Make note of the directory to which the file was downloaded.
Note: This documentation uses the environment variable $NIFI_HOME to identify
the base directory of the Apache NiFi installation on the system. You may choose
to set this environment variable in your shell login start up script on the host(s).
For example: export NIFI_HOME=/usr/local/nifi.
For more information about download prerequisites, troubleshooting, and
instructions, see Download Broadcom products and software.
Tanzu Greenplum Connector for Apache NiFi
11
3. Extract the Connector download package. For example:
$ mkdir gpnifi_work
$ cd gpnifi_work
$ tar xzf downloadir/greenplum-connector-apache-nifi-1.1.0.tar.gz
This command extracts the following files and directories to the current working directory:
File/Directory Description
commit.sha The commit identifier for this Connector release.
install.sh A Connector install script that installs to $NIFI_HOME/lib.
nars/ The directory containing the Connector NAR files.
version The version of this Connector release.
4. If you are running an Apache NiFi cluster, copy the Connector download package to all NiFi hosts,
and extract as described.
Registering the Connector with Apache NiFi
You register the Connector with Apache NiFi by copying the NAR files in nars/ to the Apache NiFi
installation on the host(s). Choose one of these options for registration:
Run the install.sh script to copy the NAR files to $NIFI_HOME/lib. If you choose this option,
you must restart Apache NiFi after you run the script, and you may be required to re-register the
Connector NAR file after you upgrade Apache NiFi.
1. Ensure that you have set the $NIFI_HOME environment variable, that it identifies your
Apache NiFi installation directory, and that you have permission to write to this directory.
2. Run the install script:
$ ./install.sh
Removing old Greenplum NiFi Connector artifacts ...
Installing new Greenplum NiFi Connector (version 1.1.0) ...
...
Successfully installed Greenplum NiFi Connector (version 1.1.0) into /us
r/local/nifi/lib
The script removes any previously-installed Connector artifacts in $NIFI_HOME/lib before
it copies the contents of nar/* to that directory.
Copy the NAR files to the Apache NiFi autoload directory (the directory specified by the
nifi.nar.library.autoload.directory property value in the nifi.properties file). NiFi auto-
loads NAR files it finds in this directory, and does not require a restart.
Copy the NAR files to a location of your choosing and set this location in the
nifi.nar.library.directory.<custom> property. You must restart Apache NiFi.
If you are running an Apache NiFi cluster, be sure to register the Connector NAR files on each NiFi host.
Tanzu Greenplum Connector for Apache NiFi
12
Upgrading the Connector
If you are using the VMware Greenplum Connector for Apache NiFi in your current Apache NiFi installation,
you must perform some upgrade actions when you install a new version of the Connector:
1. Satisfy the prerequisites.
2. Download the new version of the Connector.
3. Register the new version of the Connector with Apache NiFi.
Prerequisites
You can run a single instance of Apache NiFi, or run NiFi in a clustered environment. Before you upgrade
the Connector, ensure that you meet the following prerequisites:
You can identify the base directory of the Apache NiFi installation on the system (possibly
$NIFI_HOME).
You can identify the mode in which you are running Apache NiFi (single instance or clustered).
You can identify the method in which you previously registered the Connector (ran install.sh or
copied NAR files to autoload or custom directory).
You have administrative access to the NiFi host(s).
Downloading the Connector Package
Download and unpack the new version of the Connector from Broadcom Support Portal as described in
Downloading the Connector Package.
Registering the Connector with Apache NiFi
You register a new version of the Connector with Apache NiFi by copying the new NAR files to the Apache
NiFi installation on the host(s).
Your registration tasks will differ depending on how you initially registered the connector. (If you are running
an Apache NiFi cluster, be sure to register the Connector NAR files on each NiFi host.)
If you ran the install.sh script to register the Connector:
For more information about download prerequisites, troubleshooting, and instructions, see
Download Broadcom products and software.
Tanzu Greenplum Connector for Apache NiFi
13
1. Ensure that you have set the $NIFI_HOME environment variable, that it identifies your Apache NiFi
installation directory, and that you have permission to write to this directory.
2. Run the install script:
$ ./install.sh
Removing old Greenplum NiFi Connector artifacts ...
Installing new Greenplum NiFi Connector (version 1.1.0) ...
...
Successfully installed Greenplum NiFi Connector (version 1.1.0) into /usr/loca
l/nifi/lib
The script removes any previously-installed Connector artifacts in $NIFI_HOME/lib before it copies
the contents of nar/* to that directory.
3. Restart Apache NiFi.
If you copied the NAR files to the Apache NiFi autoload directory
(nifi.nar.library.autoload.directory):
1. Remove the previous version Connector NAR files from the directory.
2. Copy the the new Connector NAR files to the autoload directory.
If you copied the NAR files to a location of your choosing (nifi.nar.library.directory.<custom>):
1. Remove the previous version Connector NAR files from the directory.
2. Copy the the new Connector NAR files to the custom directory.
3. Restart Apache NiFi.
Tanzu Greenplum Connector for Apache NiFi
14
Using the Apache NiFi User Interface
You run the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector
for Apache NiFi to load data into Greenplum Database.
The Apache NiFi user interface operates on the following components:
FlowFile
- an object moving through the system; may be record-based
Processor
- the interface through which NiFi provides access to a FlowFile; routes, transforms,
extracts information from a FlowFile
Relationship
- one or more routes to which a FlowFile is transferred from a Processor
Connection
- a link between Processors that represents one or more Relationships
Controller Service
- an extension that provides information for use by other components; for
example, a service to: configure SSL, configure Greenplum Database connection properties, or
serialize CSV data into a record-oriented format
A dataflow that you create with the Apache NiFi user interface to load data into Greenplum will link multiple
processors, one of which will be the Connector PutGreenplumRecord processor.
This topic provides a basic introduction to using the Apache NiFi user interface to create a dataflow. Refer
to the Apache NiFi User Interface Documentation for detailed information about using this interface.
Launching the Apache NiFi User Interface
The Apache NiFi user interface is an interactive interface through which you create and manage automated
dataflows. You run the NiFi user interface in a browser window by specifying the following URL:
http://<nifi_hostname>:<nifi_port>/nifi.
The default NiFi port number is 8080. To determine the port number for your installation, examine the
nifi.web.http.port property setting in the $NIFI_HOME/conf/nifi.properties file and note the value.
If you are running NiFi on your local system using the default port, entering the following URL in a browser
window to launch the NiFi user interface:
http://localhost:8080/nifi
When you start the NiFi user interface, you are presented a canvas on which you create your dataflow.
Tanzu Greenplum Connector for Apache NiFi
15
The Apache NiFi canvas includes:
A
Components Toolbar
that consists of the components that you can drag and drop on to the NiFi
canvas.
An
Operate Palette
that includes buttons that allow you to manage the flow; you can configure,
activate/deactivate, start/stop, or delete a component. You can also manage user access and
configure system properties from this palette.
A
Status Bar
that provides runtime information about the flow, including thread counts, data transfer
amounts, and a refresh timestamp.
A
Navigate Palette
that allows you to pan around the canvas.
The canvas also provides component search capabilities, and a global menu whose options allow you to
manipulate components on the canvas.
Creating a DataFlow
To create a dataflow, you drag and drop
Processor
components on to the NiFi canvas and then connect
them with a
Connection
component.
Adding a Processor to the Canvas
The
Processor
icon is located in the NiFi
Component Toolbar
:
Tanzu Greenplum Connector for Apache NiFi
16
To add a
Processor
to the canvas:
1. Drag a
Processor
component from the
Component Toolbar
to the canvas and drop it there.
The Add Processor dialog displays:
2. Choose the
Processor
you want to add by scrolling through the list, selecting a search term from
the left pane, or entering the processor name in the Filter field in the upper right corner of the
dialog.
3. Select the desired
Processor
from the table, and double-click or press ADD.
The Add Processor dialog closes, and the
Processor
component is added to the canvas.
A processor component that you add to the canvas is in the stopped state.
Tanzu Greenplum Connector for Apache NiFi
17
Refer to Adding Components to the Canvas in the Apache NiFi documentation for detailed information about
adding a processor.
About the Context Menu
You most often interact with a component on the canvas via its context menu, which you display by right-
clicking on the component. The menu options available vary based on the type of component and your
privileges, and include
Configure
,
Start
, and
Enable
/
Disable
items.
You can operate on one or more selected components on the canvas via the buttons on the
Operate
Palette
.
After you add a processor, you must
Configure
it; configuration properties are processor-specific. For
example, Configuring the Connector describes the configuration properties for the PutGreenplumRecord
processor.
Connecting Processors in a DataFlow
You initiate a relationship by connecting processors in a dataflow. Connect processors by hovering over the
source processor, clicking the connection icon (the green highlighted arrow) in the source processor,
and dragging and dropping it on to the destination processor.
The Create Connection dialog displays. You use the DETAILS and SETTINGS tabs on this dialog to
configure the connection, including its name, thresholds, prioritization, and load balance strategy.
Connecting Components in the Apache NiFi documentation describes the available connection configuration
properties.
A connection is represented on the NiFi canvas as an object between the processors, and includes a line
with a directed arrow from source to destination:
Starting the DataFlow
A processor component that you add to the canvas is in the stopped state. A processor must be enabled
and started before it can be triggered. The processor scheduling strategy determines when and how a
processor is triggered.
Start a component by right-clicking on the component and selecting
Start
from the component context
menu. Or, start multiple components by selecting each component that you want to start, and pressing the
start button in the
Operate Palette
.
Additional Apache NiFi References
Check out these Apache NiFi documentation references for more detailed information about the framework:
Tanzu Greenplum Connector for Apache NiFi
18
Apache NiFi Overview
Getting Started with Apache NiFi
Apache NiFi User Guide
Tanzu Greenplum Connector for Apache NiFi
19
Loading Data with the Connector
You use the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector
for Apache NiFi PutGreenplumRecord processor to load record-oriented data from any source into
Greenplum Database. You will perform the following tasks when you use the Connector to load data into a
Greenplum table:
1. Ensure that you meet the prerequisites.
2. Configure the Greenplum adapter controller service.
3. Identify the format and schema of the input data.
4. Configure the record reader controller service.
5. Create the target Greenplum Database table.
6. Build a dataflow that uses the PutGreenplumRecord processor.
7. Start the dataflow.
8. Check the load operation results.
Prerequisites
Before you set up a dataflow using the Connector, ensure that:
You have access to a running Greenplum Database cluster, and you can identify the host name or
IP address of the master host and the port number on which the master server is running if it is not
the default port (5432).
Note the Greenplum master host and port number.
You can identify the host name or IP address and port number of a running Greenplum Streaming
Server instance to which the Connector will direct load requests. Or, you configure and start a new
streaming server instance as described in Configuring and Managing the Streaming Server in the
Greenplum Streaming Server documentation.
Be sure to note the host and port number of the Greenplum Streaming Server instance.
You can identify:
The name of the Greenplum
database
.
The name of the Greenplum Database
table
you want to load data into, and the name of
the
schema
in which it resides.
Note: If you are running the Connector against Greenplum Streaming Server
version 1.4.x or older, you must specify “ReuseTables”: false in the gpss.json
server configuration file that you use to start the Greenplum Streaming Server.
Tanzu Greenplum Connector for Apache NiFi
20
The
user/role
name and
password
that you will use to access Greenplum Database. This
role must be assigned certain privileges to the Greenplum database, schema, and table as
described in Configuring Greenplum Database Role Privileges in the Greenplum Streaming
Server documentation.
You have registered the Greenplum Streaming Server extension in the Greenplum database as
described in Registering the GPSS Extension in the Greenplum Streaming Server documentation.
You can identify the Apache NiFi server host and port number.
Network connectivity exists between the Apache NiFi host(s) and the Greenplum Streaming Server
host.
Configuring the Greenplum Adapter Controller Service
The PutGreenplumRecord processor uses a controller service to manage the connection to Greenplum
Database.
In the
Prerequisites
section above, you identified the Greenplum Database master host and port number,
Greenplum Streaming Server host and port number, the Greenplum database, and the Greenplum user/role
name and password. You create and configure an instance of the GreenplumGPSSAdapter controller service
for each unique combination of these property settings as described in About the Greenplum Adapter.
Identifying the Input Data Format and Schema
The PutGreenplumRecord processor can accept record-oriented data in the Avro, CSV, Json, Parquet, and
XML formats, among others. The processor uses a Record Reader controller service to parse and
deserialize data in the incoming FlowFiles. The format of the data inside the FlowFile informs both your
choice of Record Reader, and the definition of the Greenplum Database table that the Connector loads data
into.
The Record Reader requires the schema of the input data in order to parse and deserialize it.
About the Data Schema
NiFi data records are described by a schema. The schema defines the names and types of the fields in the
input data records.
A NiFi schema definition is specified in Avro format. The example Avro schema below specifies a record
with three fields, and identifies the names and data types of the fields:
{
"name": "datatypes_record",
"type": "record",
"fields": [
{ "name": "lastname", "type": ["string", "null"] },
{ "name": "age", "type": ["int", "null"] },
{ "name": "birthdate", "type": {"type":"int", "logicalType":"date"} }
]
}
When you configure a Record Reader, you must specify the origin of the schema. A schema may be:
Inferred from the input data (auto-discovered).
Tanzu Greenplum Connector for Apache NiFi
21
Embedded in the input data, such as with Parquet and Avro files.
Explicitly specified when you configure the reader.
Retrieved from a schema registry.
Once you identify and specify the schema of the incoming FlowFiles, you have the information that you
need to select and configure the Record Reader and create the Greenplum Database table.
Configuring the Record Reader Controller Service
The PutGreenplumRecord processor uses a controller service to parse and deserialize incoming data in
FlowFiles.
You choose and configure a Record Reader type and instance that corresponds to the format of the data in
the incoming FlowFiles as described in About the Record Reader. You must specify the origin of the
schema when you configure the reader controller service.
Creating the Target Greenplum Table
You specify the name of the target Greenplum table when you configure the PutGreenplumRecord
processor. You must create this table before you initiate a NiFi dataflow; the Connector does not create the
table for you.
The data types that you specify for the target Greenplum Database table columns must match the data
types of the input FlowFile record fields. You can reference the schema of the input records, or the data
itself, to identify its type and definition.
The column names that you specify when you create the Greenplum table must match the input FlowFile
field names with a few caveats. When you configure the PutGreenplumRecord processor, you specify if and
how you want the Connector to translate the field names to Greenplum table column names. You can:
Turn case-sensitivity of the translation on/off.
Specify the behaviour of the Connector when there is no matching table column for one or more
FlowFile record fields.
Specify the behaviour of the Connector when there is no matching FlowFile record field for one or
more table columns.
Specifying Field and Column Name Mappings describes the available field-to-column translation
configuration options for the Connector.
The Greenplum table may be the target of INSERT, UPDATE, or MERGE operations. To update or merge
data in a Greenplum table, you must be able to identify a set of table columns that uniquely identifies a row
in the table. The Connector uses these Match Columns to locate existing table rows. About the Insert,
Merge, and Update Properties further describes these operations and configuration properties.
Building the DataFlow
Note: Apache NiFi must read the whole FlowFile to infer the schema from the input data;
this is often inefficient.
Tanzu Greenplum Connector for Apache NiFi
22
When you build a dataflow, you drag components from the NiFi toolbar to the canvas, configure the
components, and then connect them together.
When you build a dataflow using the PutGreenplumRecord processor, you configure it as described above,
and in Configuring the Connector.
Remember to define the success, failure, and retry relationships for the FlowFiles processed by the
PutGreenplumRecord, or auto-terminate them.
Starting the DataFlow
A component that you add to a dataflow is in the stopped state. You must start all linked components to
initiate the flow.
Starting a component may also trigger a dataflow. In other cases, some external event triggers a flow, such
as a new file added to a directory, or a new message emitted by some external data source.
Checking the Load Operation Results
When a load operation succeeds, the PutGreenplumRecord processor emits a SEND provenance event.
These events are displayed in the global or processor-specific Data Provenance dialog.
You can detect load operation errors in one or more of the following ways:
View messages written to the $NIFI_HOME/logs/nifi-app.log file.
When the upper right corner of the processor component displays a red square, the processor has
encountered one or more warnings or errors. Hovering over the red square pops up a dialog that
displays the recent warnings and errors emitted by the processor.
If you have configured FlowFile routing to a failure relationship, examine the connection and
downstream processor components.
You can also query the target Greenplum table to verify the load.
Tanzu Greenplum Connector for Apache NiFi
23
Configuring the Connector
You use the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector
for Apache NiFi PutGreenplumRecord processor to load record-oriented data from any source into
Greenplum Database.
The PutGreenplumRecord processor accepts record-based FlowFiles, sending the data to the Greenplum
Streaming Server to write to Greenplum Database. When you configure the processor, you must identify the
type and instance of the RecordReader that corresponds to the format of the data contained in incoming
FlowFiles, the Greenplum connection specifics, and the Greenplum schema and table.
The default load mode for the Connector is to insert data into Greenplum. You can configure the processor
to merge or update data instead, and configuration properties for field to column translation and mappings
allow you further specify these operations.
You configure the PutGreenplumRecord processor via the Configure Processor dialog. This dialog
includes SETTINGS, SCHEDULING, PROPERTIES, and COMMENTS tabs.
PutGreenplumRecord Settings
The SETTINGS tab specifies FlowFile routing and timeouts for the processor. You can also use this tab to
change the name of the processor and activate/deactivate the processor.
The Settings Tab documentation in the Apache NiFi User Guide describes the configuration options on this
tab.
PutGreenplumRecord Schedule
The SCHEDULING tab specifies the scheduling strategy, run schedule, and concurrency options for the
processor.
When you set Concurrent Tasks to a value greater than one, the processor runs with the specified number
of threads. The single PutGreenplumRecord processor instance will process multiple flow files concurrently,
each managed by its own session.
The Scheduling Tab documentation in the Apache NiFi User Guide describes the configuration properties on
this tab.
PutGreenplumRecord Properties
The PROPERTIES tab of the Configure Processor dialog identifies the PutGreenplumRecord processor
configuration properties.
Tanzu Greenplum Connector for Apache NiFi
24
The Connector utilizes default values for many of the PutGreenplumRecord properties. You are required to
set the Record Reader, Greenplum Adapter, and Greenplum Table Name property values.
The PutGreenplumRecord processor configuration properties are listed and further described in the table
and topics below:
Property
Name Description Default Value
Record
Reader
The
controller service
that deserializes the input FlowFile. Required.
Greenplum
Adapter
The
controller service
that identifies and manages the Greenplum Database and
Greenplum Streaming Server connection parameters. Required.
Schema
Name
The name of the Greenplum Database schema in which the target table resides.
Required.
public
Table Name The name of the target Greenplum table in which to load the data. Required.
Operation
Type
The type of load operation: INSERT, UPDATE, or MERGE. Required. INSERT
Match
Columns
The Greenplum table columns to match with the FlowFile record data. Required for the
UPDATE and MERGE operation types.
Translate
Field Names
Boolean value that specifies if the Connector translates input FlowFile field names to
Greenplum table column names. When true, the Connector uses case insensitive
matching and ignores underscores. When false, the Connector does not translate, and
field and column names must match exactly.
true
Tanzu Greenplum Connector for Apache NiFi
25
Property
Name Description Default Value
Unmatched
Field Behavior
Specifies the Connectors behavior when an incoming FlowFile record has a field that
does not map to a column in the Greenplum table.
Ignore
Unmatched
Fields
Unmatched
Column
Behavior
Specifies the Connectors behavior when an incoming FlowFile record does not have a
field mapping for every one of the Greenplum table columns.
Fail on
Unmatched
Columns
Rollback On
Failure
Boolean value that specifies whether or not the Connector should roll back when it
encounters an error processing a FlowFile.
false
Maximum
Record Batch
Size
Specifies the maximum number of records in each batch of data that the Connector will
write to Greenplum. The Connector stores the batch in memory until it reaches this size.
0 (write all
records in a
single
transaction)
About the Insert, Merge, and Update Properties
The Connector supports inserting, merging, and updating records from a FlowFile into a Greenplum
Database table. You use the Operation Type property to specify the load mode:
Mode Description
INSERT Insert records as new rows into the Greenplum table (the default mode).
MERGE Use Match Columns to match records to existing table rows, and update these rows with the data from the
records. A record with no matching database row is inserted into the Greenplum table as new row.
UPDATE Use Match Columns to match records to existing table rows, and update these Greenplum Table rows with
data from the records.
Use
operation.type
Attribute
Obtain the load mode from an operation.type attribute in the FlowFile.
When Operation Type is UPDATE or MERGE, you must specify one or more Match Columns, a comma-
separated list of column names that uniquely identifies a row in the Greenplum table. The Connector ignores
the Match Columns property when the Operation Type is INSERT.
Specifying Field and Column Mapping Properties
The Connector exposes properties that allow you to choose how you want the Connector to map FlowFile
record fields to Greenplum Database table columns.
The Translate Field Name property is a boolean value that specifies if the Connector translates field
names in the FlowFile record into column names in the Greenplum table. The default value is true; the
processor uses case-insensitive matching and ignores underscores when it translates field names into
column names. When the value is false, the FlowFile field names must match the Greenplum table column
names exactly, or the column value will not be updated.
When an incoming FlowFile record has a field that does not map to any of the columns in the Greenplum
table, set the Unmatched Field Behaviour property to specify how the Connector should handle the
situation:
Tanzu Greenplum Connector for Apache NiFi
26
Ignore Unmatched Fields - (the default) The Connector ignores any field in the FlowFile record
that cannot be mapped to a column in the Greenplum table.
Fail on Unmatched Fields - The Connector routes the FlowFile to the failure relationship when
the record has any field that cannot be mapped to a column in the table.
Reference Parameter
If an incoming FlowFile record does not have a field mapping for every one of the columns in the Greenplum
table, set the Unmatched Column Behavior property to specify how the Connector should handle the
situation:
Ignore Unmatched Columns - The Connector assumes that a column in the table that does not
have a matching field in the record is not required.
Warn on Unmatched Columns - The Connector assumes that a column in the table that does not
have a matching field in the record is not required, and the Connector logs a warning.
Fail on Unmatched Columns - (the default) A flow fails when a column exists in the table and
there is no matching field in the record. The Connector also logs an error.
Reference Parameter
Specifying Failure Rollback Behavior
The Connector distinguishes between the transient and the non-recoverable errors that it encounters.
Transient errors are those that may succeed on a later retry, such as a connection attempt to Greenplum
Database. Conversely, a FlowFile that contained bad input data would continue to fail when retried.
The Connector applies success or failure at the FlowFile level. That is, the Connector considers a write
operation successful if all records in a single FlowFile are written to the Greenplum Database table with no
errors. If a single record in the FlowFile fails to write for some reason (say the data is malformed), none of
the records in the FlowFile are written to Greenplum, and the Connector considers the operation failed.
Rollback On Failure is a boolean property that specifies whether or not the Connector rolls back the NiFi
session when it encounters a failure processing a FlowFile.
The default Rollback On Failure setting is false. When the Connector encounters an error while
processing a FlowFile, the FlowFile is routed to the failure or retry relationship based on the error type,
and the processor continues processing the next FlowFile.
When Rollback On Failure is true, the Connector:
Stops further processing a FlowFile when it encounters an error,
Rolls back the NiFi session; this penalizes the FlowFile and returns it to the incoming queue, and
Continues processing the next FlowFile.
The rolled back FlowFile may be processed repeatedly by the Connector until it is processed successfully
or removed by other means.
Be sure to set an adequate SETTINGS Yield Duration for the processor to avoid retrying too frequently.
Choosing a Maximum Record Batch Size
For each FlowFile it receives, the Connector:
Tanzu Greenplum Connector for Apache NiFi
27
Opens and prepares the table for writing,
Performs one or more writes, and
Closes/commits the write.
The maximum number of records in a write call that the Connector makes to the Greenplum Streaming
Server is determined by the Maximum Record Batch Size that you specify for the processor.
The default value is zero (0); there is no limit on the batch size, and the Connector accumulates all FlowFile
content in memory before it writes to Greenplum in a single transaction.
About the Record Reader
The PutGreenplumRecord Record Reader property identifies the controller service that the processor uses
to deserialize incoming data into NiFi records. You select an appropriate reader based on the format of the
input data.
(This older Record-Oriented Data with NiFi blog describes the Apache NiFi processors and controller
services available for working with record-oriented data.)
You configure a PutGreenplumRecord Record Reader via the processor configuration dialog,
PROPERTIES tab. You can also add a new reader instance via the
Operate Palette
configuration dialog
CONTROLLER SERVICES tab.
Note: This default behavior may be inefficient for FlowFiles that consist of large and/or
many records.
Tanzu Greenplum Connector for Apache NiFi
28
The Connector supports all compatible Record Reader controller services, and has been specifically tested
with certain data formats. These readers, data formats, and their schema origins are identified in the table
below.
Reader
Name
Data
Format Schema Description
AvroReader Avro Embedded in the Avro data, obtained from a
schema registry, or explicitly specified.
Each Avro record is deserialized to a NiFi record.
CSVReader CSV Inferred from the data, obtained from a
schema registry, or explicitly specified.
Each row is deserialized to a NiFi record.
JsonTreeR
eader
JSON Inferred from the data, obtained from a
schema registry, or explicitly specified.
Each JSON record is deserialized to a NiFi
record.
ParquetRea
der
Parquet Embedded in the Parquet data. Each Parquet record is deserialized to a NiFi
record.
XMLReader XML Inferred from the data, obtained from a
schema registry, or explicitly specified.
The second level (within enclosing root tag) of
XML data is deserialized into a NiFi record.
Tanzu Greenplum Connector for Apache NiFi
29
About the Greenplum Adapter
The PutGreenplumRecord Greenplum Adapter property identifies the controller service that specifies and
manages the connection to Greenplum Database. The only compatible controller service for this function is
named GreenplumGPSSAdapter.
You configure a PutRecordProcessor Greenplum Adapter via the processor configuration dialog,
PROPERTIES tab. You can also add a new adapter instance via the
Operate Palette
configuration dialog
CONTROLLER SERVICES tab.
When you configure the GreenplumGPSSAdapter controller service, you identify a Greenplum Streaming
Server instance host and port. You also specify Greenplum Database connection properties including
master host and port, database and user names, and user password.
While all connection properties are required, the Connector utilizes default values for many.
Property Name Description Default
Value
Greenplum Streaming Server
Host
The name or IP address of the host on which the GPSS instance is running.
Required.
Greenplum Streaming Server
Port
The GPSS port number. Required. 5000
Greenplum Database Master
Host
The name or IP address of the Greenplum Database master host.
Required.
Tanzu Greenplum Connector for Apache NiFi
30
Property Name Description Default
Value
Greenplum Database Master
Port
The Greenplum master server port number. Required. 5432
Greenplum Database Name The name of the Greenplum database. Required. postgres
Greenplum User Name The Greenplum user/role name to use to connect to the database. gpadmin
Greenplum User Password The password for the Greenplum user/role. Required.
You can re-use a Greenplum Adapter controller service for any dataflows that use the specified Greenplum
Streaming Server instance to load data to the same Greenplum master host and database as the specified
user.
Tanzu Greenplum Connector for Apache NiFi
31
Example - Loading CSV Data from the File
System
In this example, you use the VMware Greenplum Connector for Apache NiFi to load CSV-format data into
Greenplum Database.
The CSV data represents department expense records, and includes department identifier (integer), month
(integer), and expenses (decimal) fields. For example, a record for a department with identifier 123 that
spent $456.78 in the month of September follows:
"123","09","456.78"
A record with the same department identifier and month identifies a new expense total for the month,
replacing the previous amount.
You will use the Apache NiFi user interface to create a dataflow between the GetFile and
PutGreenplumRecord processors.
In this flow:
Tanzu Greenplum Connector for Apache NiFi
32
The GetFile processor reads CSV files from the /tmp/gcan_data directory on the NiFi system
and generates record-based FlowFiles.
The PutGreenplumRecord processor writes the data that it receives to a Greenplum Database table
named gcan_dept_expense located in the public schema of a database named testdb.
The department identifier and month fields together uniquely identify a table row.
The write to Greenplum should specify the MERGE operation type; if an entry for the
department/month does not exist, insert a new row into the table. If an entry for the department for
the month already exists, replace the expenses amount with the new value.
You will explicitly specify the input FlowFile schema.
Prerequisites
Before you start this procedure, ensure that you:
Have access to a running Greenplum Database cluster.
Have access to a running Greenplum Streaming Server instance, or the privileges required to start
an instance.
Have met the Prerequisites identified in the
Loading
topic.
For simplicity, this example assumes that Apache NiFi, Greenplum Database, and the Greenplum
Streaming Server are running on the same host.
Process
Step 1: Prepare the Example Environment
Step 2: Add and Configure the GetFile Processor
Step 3: Configure a GreenplumGPSSAdapter Controller Service
Step 4: Identify the Input Data Source, Format, and Schema
Step 5: Configure a Record Reader Controller Service
Step 6: Add and Configure the PutGreenplumRecord Processor
Step 7: Connect and Start the Processors
Step 8: Create the Greenplum Database and Table
Step 9: Trigger the Flow and Check Results
Step 1: Prepare the Example Environment
In this step, you create sample data files.
1. Log in to your Apache NiFi client system.
$ ssh user@nifihost
user@nifihost$
2. Create a working directory. For example:
Tanzu Greenplum Connector for Apache NiFi
33
user@nifihost$ mkdir gcan_work
user@nifihost$ cd gcan_work
3. Prepare some sample data:
1. Write some data into a CSV file named sample1.csv:
user@nifihost$ echo '"dept_id","month","expenses"
"1313131","12","1313.13"
"3535353","11","761.35"
"7979797","10","4489.00"
"7979797","11","18.72"
"3535353","10","6001.94"
"7979797","12","173.18"
"1313131","10","492.83"
"3535353","12","81.12"
"1313131","11","368.27"' > sample1.csv
2. Write some data into a CSV file named sample2.csv:
user@nifihost$ echo '"dept_id","month","expenses"
"1313131","11","555.55"
"7979797","10","5555.55"
"2222222","12","22.22"' > sample2.csv
The data added to this file represents an expense for a new department (2222222), and
new/updated expense values for two existing departments/months.
4. Create an input directory and set the appropriate permissions:
user@nifihost$ mkdir /tmp/gcan_data
user@nifihost$ chmod a+rwx /tmp/gcan_data
You will copy the sample data files to the input directory later in this procedure.
5. Start the Apache NiFi user interface. For example, if your NiFi server is running on the local host on
port number 9050, enter the following in a web browser window:
http://localhost:9050
Step 2: Add and Configure the GetFile Processor
Perform the following steps to add and configure a GetFile processor instance:
1. Click the Processors icon in the Apache NiFi
components toolbar
and drag to the canvas.
This action opens the Add Processor dialog.
2. Search for the GetFile Processor by typing in the Filter field.
3. Click Add.
This action adds a GetFile processor component to the canvas.
4. Right-click on the component and select Configure from the context menu.
This action displays the Configure Processor dialog.
Tanzu Greenplum Connector for Apache NiFi
34
5. Select the PROPERTIES tab.
1. Locate the Input Directory property and set the Value to the directory that you created in
Step 1, /tmp/gcan_data.
2. Click OK.
3. APPLY the Configure Processor changes.
Step 3: Configure a GreenplumGPSSAdapter Controller
Service
Perform the steps below to configure an instance of the GreenplumGPSSAdapter controller service named
GreenplumGPSSAdapter-testdb:
1. Click on an empty area in the Apache NiFi canvas.
2. Click on the configure icon in the
Operate Palette
.
This action opens the NiFi Flow Configuration dialog.
3. Select the CONTROLLER SERVICES tab.
4. Click the + icon to add a new controller service.
This action opens the Add Controller Service dialog.
5. Type Greenplum in the Filter field, select the GreenplumGPSSAdapter entry, and click ADD.
This action adds a GreenplumGPSSAdapter row to the table of currently defined controller services,
and selects this row.
6. Click on the configure icon in the last column of the table to configure the service.
This action opens the Configure Controller Service dialog.
7. Select the SETTINGS tab, locate the Name field, and set the name to GreenplumGPSSAdapter-
testdb.
8. Select the PROPERTIES tab, locate the properties identified in the table below, and set each Value
as specified:
Property Name Value Comments
Greenplum Streaming Server Host localhost Enter your host
Greenplum Streaming Server Port 5000 Retain the default
Greenplum Database Master Host localhost Enter your Greenplum master host
Greenplum Database Master Port 5432 Retain the default
Greenplum Database Name testdb
Greenplum Database User Name gpadmin You can choose a different Greenplum user
Greenplum Database User Password
changeme
Enter the password
9. APPLY the Configure Controller Service changes.
Tanzu Greenplum Connector for Apache NiFi
35
10. Click the thunderbolt icon in the GreenplumGPSSAdapter-testdb row to enable the controller
service.
The Enable Controller Service dialog displays.
1. Click the ENABLE button.
2. Click the CLOSE button.
11. Click X in the upper right hand of the dialog to close the NiFi Flow Configuration window.
Step 4: Identify the Input Data Source, Format, and
Schema
The source of the data is the GetFile processor, and the data format is CSV.
Because the CSV file includes a header row, you could choose to have Apache NiFi infer the schema. For
this exercise, you will explicitly define and specify the schema.
As decribed above, the CSV data represents department expense records, and includes department
identifier (integer), month (integer), and expenses (decimal) fields:
"123","09","456.78"
The schema that corresponds to records of this format follows:
{
"name": "dept_expense_record",
"namespace": "nifi_csv_example",
"type": "record",
"fields": [
{ "name": "dept_id", "type": ["int", "null"] },
{ "name": "month", "type": ["int", "null"] },
{ "name": "expenses", "type": {"type": "bytes", "logicalType": "decimal", "pre
cision": 11, "scale": 2 } }
]
}
You will specify this schema when you configure a record reader controller service for a
PutGreenplumRecord processor instance.
Step 5: Configure a Record Reader Controller Service
Perform the steps below to configure an instance of a CSV record reader controller service named
CSVReader-dept-expenses:
1. Click on an empty area in the Apache NiFi canvas.
2. Click on the configure icon in the
Operate Palette
.
This action opens the NiFi Flow Configuration dialog.
3. Select the CONTROLLER SERVICES tab.
4. Click the + icon to add a new controller service.
This action opens the Add Controller Service dialog.
Tanzu Greenplum Connector for Apache NiFi
36
5. Type CSV in the Filter field, select the CSVReader entry, and click ADD.
This action adds a CSVReader row to the table of currently defined controller services, and selects
this row.
6. Click on the configure icon in the last column of the table to configure the service.
This action opens the Configure Controller Service dialog.
7. Select the SETTINGS tab, locate the Name field, and set the name to CSVReader-dept-expenses.
8. Select the PROPERTIES tab, locate the properties identified in the table below, and set each Value
as specified:
Property Name Value Comments
Schema Access
Strategy
Use ‘Schema Text’
Property
The Schema Text property value will specify the schema
definition
Treat First Line as
Header
true The first line of the file is the header
9. Locate the Schema Text property, and copy/paste the schema definition below into the Value field:
{
"name": "dept_expense_record",
"namespace": "nifi_csv_example",
"type": "record",
"fields": [
{ "name": "dept_id", "type": ["int", "null"] },
{ "name": "month", "type": ["int", "null"] },
{ "name": "expenses", "type": {"type": "bytes", "logicalType": "decimal",
"precision": 11, "scale": 2 } }
]
}
10. Retain the default values for the other properties.
11. APPLY the Configure Controller Service changes.
12. Click the thunderbolt icon in the CSVReader-dept-expenses row to enable the controller service.
The Enable Controller Service dialog displays.
1. Click the ENABLE button.
2. Click the CLOSE button.
13. Click X in the upper right hand of the dialog to close the NiFi Flow Configuration window.
Step 6: Add and Configure the PutGreenplumRecord
Processor
Perform the following steps to add and configure a PutGreenplumRecord processor instance:
1. Click the Processors icon in the Apache NiFi
components toolbar
and drag it to the canvas.
This action opens the Add Processor dialog.
2. Search for the PutGreenplumRecord Processor by typing in the Filter field.
Tanzu Greenplum Connector for Apache NiFi
37
3. Click Add.
This action adds a PutGreenplumRecord Processor component to the canvas.
4. Right-click on the component and select Configure from the context menu.
This action displays the Configure Processor dialog.
5. Select the SETTINGS tab.
6. Automatically terminate all relationships by checking the failure, retry, and success checkboxes.
7. Select the PROPERTIES tab.
8. Locate the Record Reader property. Click in the Value field, then select CSVReader-dept-
expenses from the drop-down menu, and click OK.
9. Locate the Greenplum Adapter property. Click in the Value field, select GreenplumGPSSAdapter-
testdb from the drop-down menu, and click OK.
10. Locate the properties identified in the table below and set each Value as specified:
Property Name Value Comments
Schema Name public Retain the default
Table Name gcan_dept_expense You will create this table in the next step
Operation Type MERGE Merge can both insert and update a table row
Match Columns dept_id, month A table row is uniquely identified by these column
values
Translate Field Names true Retain the default
Unmatched Field Behavior Ignore Unmatched Fields Retain the default
Unmatched Column
Behavior
Warn on Unmatched
Columns
Log a warning message
Rollback On Failure false Retain the default
Maximum Record Batch
Size
100
11. APPLY the Configure Processor changes.
Step 7: Connect and Start the Processors
In this step, you create a connection between the GetFile and PutGreenplumRecord processors on the
canvas, and then start the processors.
1. Hover over the GetFile component on the canvas.
2. Click the arrow icon and drag over to the PutGreenplumRecord component.
This action displays the Create Connection dialog.
3. No configuration is required; click ADD to create the connection.
A line/box that represents the connection is displayed on the NiFi canvas.
Tanzu Greenplum Connector for Apache NiFi
38
4. Right-click on the GetFile component and select Start from the context menu to start the
processor.
The icon next to the processor name changes to a green sideways triangle.
5. Right-click on the PutGreenplumRecord component and select Start from the context menu to start
the processor.
The icon next to the processor name changes to a green sideways triangle.
Step 8: Create the Greenplum Database and Table
In this step, you create the Greenplum database testdb if it does not yet exist, and create the target
Greenplum table.
1. Open a new terminal window, log in to the Greenplum Database master host as the gpadmin
administrative user, and set up your Greenplum environment. For example:
$ ssh gpadmin@gpmaster
gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
2. Create a database named testdb if one does not already exist:
gpadmin@gpmaster$ createdb testdb
3. Start the psql subsystem:
gpadmin@gpmaster$ psql -d testdb
4. The Greenplum Streaming Server must be registered in the database to use the Connector. You
can register the Greenplum Streaming Server as follows:
testdb=# CREATE EXTENSION IF NOT EXISTS gpss;
This command registers the extension only if it has not been previously registered.
5. Create the target Greenplum Database table named gcan_dept_expense:
testdb=# CREATE TABLE gcan_dept_expense( dept_id int8, month int8, expenses dec
imal(11,2) );
This table definition matches the input data schema that you specified for the record reader in Step
5.
6. Stay in the psql subsystem, you will be back.
Step 9: Trigger the Flow and Check Results
You will individually copy the sample data files to /tmp/gcan_data on the Apache NiFi system to trigger the
flow. You will check the results by observing the Apache NiFi user interface and querying the Greenplum
table.
You will also generate a sample file with bad data, trigger the flow, and check the results.
Tanzu Greenplum Connector for Apache NiFi
39
1. Copy the sample1.csv data file to the input directory:
user@nifihost$ cp gcan_work/sample1.csv /tmp/gcan_data/
2. Examine the GetFile and PutGreenplumRecord processor components on the NiFi canvas, and
notice when their statistics update.
3. Examine the contents of the Greenplum Database table. Enter the following command in the psql
terminal session that you used earlier:
testdb=# SELECT * FROM gcan_dept_expense ORDER BY dept_id, month;
dept_id | month | expenses
---------+-------+----------
1313131 | 10 | 492.83
1313131 | 11 | 555.55
1313131 | 12 | 1313.13
3535353 | 10 | 6001.94
3535353 | 11 | 761.35
3535353 | 12 | 81.12
7979797 | 10 | 5555.55
7979797 | 11 | 18.72
7979797 | 12 | 173.18
(9 rows)
4. Copy the sample2.csv data file to the input directory:
user@nifihost$ cp gcan_work/sample2.csv /tmp/gcan_data/
5. Wait until flow between the GetFile and PutGreenplumRecord processor components is triggered.
6. Query the table again:
testdb=# SELECT * FROM gcan_dept_expense ORDER BY dept_id, month;
dept_id | month | expenses
---------+-------+----------
1313131 | 10 | 492.83
1313131 | 11 | 555.55
1313131 | 12 | 1313.13
2222222 | 12 | 22.22
3535353 | 10 | 6001.94
3535353 | 11 | 761.35
3535353 | 12 | 81.12
7979797 | 10 | 5555.55
7979797 | 11 | 18.72
7979797 | 12 | 173.18
(10 rows)
Notice the new row for department 2222222, and the updated expenses values for department
1313131, month 11 and department 7979797, month 10.
7. Write a sample file with bad input data directly to the input directory:
user@nifihost$ echo '"dept_id","month","expenses"
"1313131","12","12222.22"
"7979797","zz","5555.55"' > /tmp/gcan_data/sample3.csv
Tanzu Greenplum Connector for Apache NiFi
40
This data includes the value zz in what should be an int field.
8. Observe the NiFi canvas and wait for the flow to triger. Notice that the PutGreenplumRecord
processor canvas component eventually displays a red box in the right-hand corner. Hover over the
red box and view the warning message. The processor generates a NumberFormatException when
attempting to write the second record to the Greenplum table.
9. Query the table again. In this query, filter on the department identifier in the first record of the
sample3.csv data file to display only the table rows associated with that department:
testdb=# SELECT * FROM gcan_dept_expense WHERE dept_id=1313131 ORDER BY month;
dept_id | month | expenses
---------+-------+----------
1313131 | 10 | 492.83
1313131 | 11 | 555.55
1313131 | 12 | 1313.13
(3 rows)
Notice that the first record in sample3.csv, even though correctly formatted, was not written to the
table. The Connector must process all records in the FlowFile successfully before it will write the
batch to Greenplum Database.
Tanzu Greenplum Connector for Apache NiFi
41