Tanzu Greenplum Connector for Apache NiFi PDF Free Download

Name: Tanzu Greenplum Connector for Apache NiFi PDF
Author: j.taylor

1 / 41

0 views•41 pages

Tanzu Greenplum Connector for Apache NiFi PDF Free Download

Tanzu Greenplum Connector for Apache NiFi PDF free Download. Think more deeply and widely.

Tanzu Greenplum

Connector for Apache NiFi

Tanzu Greenplum Connector for Apache NiFi 1.1

You can find the most up-to-date technical documentation on the VMware by Broadcom website at:

https://techdocs.broadcom.com/

VMware by Broadcom

3401 Hillview Ave.

Palo Alto, CA 94304

www.vmware.com

subsidiaries. For more information, go to https://www.broadcom.com. All trademarks, trade names, service marks,

and logos referenced herein belong to their respective companies.

Tanzu Greenplum Connector for Apache NiFi

Contents

VMware Greenplum Connector for Apache NiFi 1.1

Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

VMware Greenplum Connector for Apache NiFi 1.x Release Notes

Supported Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Upgrading to Version 1.x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Release 1.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Resolved Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Release 1.1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Resolved Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Release 1.0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Release 1.0.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Known Issues and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Real-Time Data Ingestion with Greenplum Database and Apache

NiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Installing the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Downloading the Connector Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Registering the Connector with Apache NiFi . . . . . . . . . . . . . . . . . . . . . . . . . . .

Upgrading the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Downloading the Connector Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Registering the Connector with Apache NiFi . . . . . . . . . . . . . . . . . . . . . . . . . . .

Using the Apache NiFi User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . .

Launching the Apache NiFi User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Creating a DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Adding a Processor to the Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About the Context Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Connecting Processors in a DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Starting the DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Additional Apache NiFi References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Loading Data with the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conﬁguring the Greenplum Adapter Controller Service . . . . . . . . . . . . . . . . . .

Identifying the Input Data Format and Schema . . . . . . . . . . . . . . . . . . . . . . . .

About the Data Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conﬁguring the Record Reader Controller Service . . . . . . . . . . . . . . . . . . . . . .

Tanzu Greenplum Connector for Apache NiFi

Creating the Target Greenplum Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Building the DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Starting the DataFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Checking the Load Operation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conﬁguring the Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PutGreenplumRecord Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PutGreenplumRecord Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PutGreenplumRecord Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About the Insert, Merge, and Update Properties . . . . . . . . . . . . . . . . . . . . . . . . . .

Specifying Field and Column Mapping Properties . . . . . . . . . . . . . . . . . . . . . . . . . .

Specifying Failure Rollback Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Choosing a Maximum Record Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About the Record Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About the Greenplum Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Example - Loading CSV Data from the File System . . . . . . . . . . . . . .

Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Step 1: Prepare the Example Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Step 2: Add and Conﬁgure the GetFile Processor . . . . . . . . . . . . . . . . . . . . . . .

Step 3: Conﬁgure a GreenplumGPSSAdapter Controller Service . . . . . . . . . . . .

Step 4: Identify the Input Data Source, Format, and Schema . . . . . . . . . . . . .

Step 5: Conﬁgure a Record Reader Controller Service . . . . . . . . . . . . . . . . . . .

Step 6: Add and Conﬁgure the PutGreenplumRecord Processor . . . . . . . . . . .

Step 7: Connect and Start the Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Step 8: Create the Greenplum Database and Table . . . . . . . . . . . . . . . . . . . . .

Step 9: Trigger the Flow and Check Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

Tanzu Greenplum Connector for Apache NiFi

VMware Greenplum Connector for Apache

NiFi 1.1 Documentation

This documentation describes how to install, configure, and use the VMware Greenplum Connector for

Apache NiFi.

Key topics in the VMware Greenplum Connector for Apache NiFi Documentation include:

Release Notes

Overview of the Connector

Installing the Connector

Using the Apache NiFi User Interface

Loading Data with the Connector

Configuring the Connector

Example: Loading CSV Data from the File System

Tanzu Greenplum Connector for Apache NiFi

VMware Greenplum Connector for Apache

NiFi 1.x Release Notes

You can use the VMware Greenplum Connector for Apache NiFi in an Apache NiFi dataflow to load record-

oriented data from any source into VMware Greenplum. The Connector uses the VMware Greenplum

Streaming Server to load the data in parallel.

Supported Platforms

The following table identifies the supported component versions for the VMware Greenplum Connector for

Apache NiFi version 1.0.x:

Connector Version Apache NiFi Version Greenplum Version Greenplum Streaming Server Version

1.1.0, 1.0.1 1.10.x, 1.11.x, 1.12.x 6.x 1.4.1+1

1.0.0 1.10.x, 1.11.x, 1.12.x 6.x 1.4.1 - 1.6.0

1 Refer to Known Issues and Limitations for specific version caveats.

Component documentation references:

Apache NiFi

VMware Greenplum

VMware Greenplum Streaming Server

Upgrading to Version 1.x

If you are currently using the VMware Greenplum Connector for Apache NiFi, you may be required to

perform upgrade actions for this release. Review Upgrading the Connector to plan your upgrade.

Release 1.1.1

Release Date: December 20, 2023

Resolved Issues

VMware Greenplum Connector for Apache NiFi 1.1.1 resolves these issues:

N/A

Resolves CVE-2022-42889 by ceasing shipment of Apache commons-text library.

N/A

Tanzu Greenplum Connector for Apache NiFi

Resolves CVE-2023-1428, CVE-2023-32731, and CVE-2023-32731 by updating gRPC libraries (and

their transitive dependencies) to version 1.53.0.

Release 1.1.0

Release Date: September 9, 2022

Changes

When a column in the target Greenplum Database table has no matching field in an incoming NiFi record

and the PutGreenplumRecord processor Unmatched Column Behavior configuration setting specifies

either Ignore Unmatched Columns or Warn on Unmatched Columns, the VMware Greenplum Connector for

Apache NiFi sends no column value for the record to GPSS. If the Greenplum table definition includes a

default value for the column, Greenplum Database writes the tuple with the default column value; otherwise

Greenplum writes the tuple with a NULL column value.

Resolved Issues

VMware Greenplum Connector for Apache NiFi 1.1.0 resolves these issues:

32345

Resolves an issue where a default column value was not applied when a NiFi record had no

matching field; see Changes above.

Release 1.0.1

Release Date: June 14, 2022

Changes

VMware Greenplum Connector for Apache NiFi version 1.0.1 adds support for Greenplum Streaming Server

versions 1.6.1+.

Release 1.0.0

Release Date: December 16, 2020

Features

The VMware Greenplum Connector for Apache NiFi 1.0.0 release supports inserting, merging, and updating

record-oriented data in Greenplum Database. You can set up an Apache NiFi dataflow that sends Avro,

CSV, Parquet, JSON, or XML data to the Connector PutGreenplumRecord processor to write to a

Greenplum table.

Known Issues and Limitations

The VMware Greenplum Connector for Apache NiFi 1.0.x has these known issues and limitations:

Using the Connector with Greenplum Streaming Server versions prior to v1.7.1 when loading

timestamp without timezone type data into Greenplum Database may expose a data type

Tanzu Greenplum Connector for Apache NiFi

conversion issue. This issue is resolved in Greenplum Streaming Server version 1.7.1;

be sure to

use the VMware Greenplum Connector for Apache NiFi version 1.0.1+ should you encounter this

scenario.

If you run the Connector against Greenplum Streaming Server version 1.4.x or older, you must

specify "ReuseTables": false in the gpss.json server configuration file that you use to start the

Greenplum Streaming Server.

The Connector does not support SSL-encrypted connections to the Greenplum Streaming Server.

A dataflow that you create that routes Avro data from the QueryDatabaseTableRecord processor to

the PutGreenplumRecord processor must set the Use Avro Logical Types property to true to

send date and time data in the format supported by the Connector.

Apache NiFi bug NIFI-7817 causes the built-in ParquetReader processor to fail when you use it

with Apache NiFi version 1.12.

Decimal types may lose precision when you use the Connector with Apache NiFi versions 1.10 and

1.11.

Tanzu Greenplum Connector for Apache NiFi

Real-Time Data Ingestion with Greenplum

Database and Apache NiFi

VMware Greenplum is a massively parallel processing database server specially designed to manage large

scale analytic data warehouses and business intelligence workloads. Apache NiFi is a framework that

provides an interactive user interface through which you create and manage automated dataflows between

systems. The VMware Greenplum Connector for Apache NiFi provides organizations a fast and simple way

to build data ingestion pipelines for Greenplum Database, code-free.

You can use the web-based Apache NiFi user interface and built-in NiFi processors to set up a data pipeline

that employs the Connector’s PutGreenplumRecord processor to load record-oriented data into Greenplum

Database for subsequent analytics.

The Connector:

Utilizes the drag-and-drop-based Apache NiFi user interface for component and data pipeline

configuration.

Supports CSV, Avro, Parquet, JSON, and XML input data formats using built-in NiFi Record

Readers.

Converts NiFi records into Greenplum tuples.

Loads the tuples into Greenplum Database.

The VMware Greenplum Connector for Apache NiFi uses the Greenplum Streaming Server to load data in

parallel into Greenplum Database. This facilitates higher concurrency and throughput during data ingestion

compared to a JDBC-based NiFi processor, with less load on the Greenplum Database master host.

Tanzu Greenplum Connector for Apache NiFi

Next Steps

Install the Connector and register it with Apache NiFi.

Review an introduction to the Apache NiFi user interface in Using the Apache NiFi User Interface.

Examine the data load procedure described in Loading Data with the Connector.

Try out the Loading CSV Data from the File System example.

Tanzu Greenplum Connector for Apache NiFi

Installing the Connector

The VMware Greenplum Connector for Apache NiFi is available as a separate download for VMware

Greenplum 6.x from Broadcom Support Portal.

Prerequisites

You can run a single instance of Apache NiFi, or run NiFi in a clustered environment. Before installing the

Connector, ensure that you meet the following prerequisites:

Apache NiFi requires Java version 8 or 11. Install Java 8 or 11 on your NiFi host(s).

You have installed Apache NiFi on your single or cluster host(s). Refer to Downloading and

Installing NiFi in the Apache NiFi documentation for instructions.

You have administrative access to the NiFi host(s).

Downloading the Connector Package

The Connector is available as a separate download for Greenplum Database 6.x from Broadcom Support

Portal. The Connector download package is a .tar.gz file; it includes the Apache NiFi NAR files for the

Connector and an installation script.

Perform these steps to download the Connector package:

1. Navigate to the Greenplum Database product on Broadcom Support Portal, select

Greenplum

Connector for Apache NiFi 1.1.0

under the desired Greenplum release.

The format of the Connector download file name is greenplum-connector-apache-nifi-

<version>.tar.gz. For example:

greenplum-connector-apache-nifi-1.1.0.tar.gz

2. Make note of the directory to which the file was downloaded.

Note: This documentation uses the environment variable $NIFI_HOME to identify

the base directory of the Apache NiFi installation on the system. You may choose

to set this environment variable in your shell login start up script on the host(s).

For example: export NIFI_HOME=/usr/local/nifi.

For more information about download prerequisites, troubleshooting, and

instructions, see Download Broadcom products and software.

Tanzu Greenplum Connector for Apache NiFi

3. Extract the Connector download package. For example:

$ mkdir gpnifi_work

$ cd gpnifi_work

$ tar xzf downloadir/greenplum-connector-apache-nifi-1.1.0.tar.gz

This command extracts the following files and directories to the current working directory:

File/Directory Description

commit.sha The commit identifier for this Connector release.

install.sh A Connector install script that installs to $NIFI_HOME/lib.

nars/ The directory containing the Connector NAR files.

version The version of this Connector release.

4. If you are running an Apache NiFi cluster, copy the Connector download package to all NiFi hosts,

and extract as described.

Registering the Connector with Apache NiFi

You register the Connector with Apache NiFi by copying the NAR files in nars/ to the Apache NiFi

installation on the host(s). Choose one of these options for registration:

Run the install.sh script to copy the NAR files to $NIFI_HOME/lib. If you choose this option,

you must restart Apache NiFi after you run the script, and you may be required to re-register the

Connector NAR file after you upgrade Apache NiFi.

1. Ensure that you have set the $NIFI_HOME environment variable, that it identifies your

Apache NiFi installation directory, and that you have permission to write to this directory.

2. Run the install script:

$ ./install.sh

Removing old Greenplum NiFi Connector artifacts ...

Installing new Greenplum NiFi Connector (version 1.1.0) ...

...

Successfully installed Greenplum NiFi Connector (version 1.1.0) into /us

r/local/nifi/lib

The script removes any previously-installed Connector artifacts in $NIFI_HOME/lib before

it copies the contents of nar/* to that directory.

Copy the NAR files to the Apache NiFi autoload directory (the directory specified by the

nifi.nar.library.autoload.directory property value in the nifi.properties file). NiFi auto-

loads NAR files it finds in this directory, and does not require a restart.

Copy the NAR files to a location of your choosing and set this location in the

nifi.nar.library.directory.<custom> property. You must restart Apache NiFi.

If you are running an Apache NiFi cluster, be sure to register the Connector NAR files on each NiFi host.

Tanzu Greenplum Connector for Apache NiFi

Upgrading the Connector

If you are using the VMware Greenplum Connector for Apache NiFi in your current Apache NiFi installation,

you must perform some upgrade actions when you install a new version of the Connector:

1. Satisfy the prerequisites.

2. Download the new version of the Connector.

3. Register the new version of the Connector with Apache NiFi.

Prerequisites

You can run a single instance of Apache NiFi, or run NiFi in a clustered environment. Before you upgrade

the Connector, ensure that you meet the following prerequisites:

You can identify the base directory of the Apache NiFi installation on the system (possibly

$NIFI_HOME).

You can identify the mode in which you are running Apache NiFi (single instance or clustered).

You can identify the method in which you previously registered the Connector (ran install.sh or

copied NAR files to autoload or custom directory).

You have administrative access to the NiFi host(s).

Downloading the Connector Package

Download and unpack the new version of the Connector from Broadcom Support Portal as described in

Downloading the Connector Package.

Registering the Connector with Apache NiFi

You register a new version of the Connector with Apache NiFi by copying the new NAR files to the Apache

NiFi installation on the host(s).

Your registration tasks will differ depending on how you initially registered the connector. (If you are running

an Apache NiFi cluster, be sure to register the Connector NAR files on each NiFi host.)

If you ran the install.sh script to register the Connector:

For more information about download prerequisites, troubleshooting, and instructions, see

Download Broadcom products and software.

Tanzu Greenplum Connector for Apache NiFi

1. Ensure that you have set the $NIFI_HOME environment variable, that it identifies your Apache NiFi

installation directory, and that you have permission to write to this directory.

2. Run the install script:

$ ./install.sh

Removing old Greenplum NiFi Connector artifacts ...

Installing new Greenplum NiFi Connector (version 1.1.0) ...

...

Successfully installed Greenplum NiFi Connector (version 1.1.0) into /usr/loca

l/nifi/lib

The script removes any previously-installed Connector artifacts in $NIFI_HOME/lib before it copies

the contents of nar/* to that directory.

3. Restart Apache NiFi.

If you copied the NAR files to the Apache NiFi autoload directory

(nifi.nar.library.autoload.directory):

1. Remove the previous version Connector NAR files from the directory.

2. Copy the the new Connector NAR files to the autoload directory.

If you copied the NAR files to a location of your choosing (nifi.nar.library.directory.<custom>):

1. Remove the previous version Connector NAR files from the directory.

2. Copy the the new Connector NAR files to the custom directory.

3. Restart Apache NiFi.

Tanzu Greenplum Connector for Apache NiFi

Using the Apache NiFi User Interface

You run the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector

for Apache NiFi to load data into Greenplum Database.

The Apache NiFi user interface operates on the following components:

FlowFile

- an object moving through the system; may be record-based

Processor

- the interface through which NiFi provides access to a FlowFile; routes, transforms,

extracts information from a FlowFile

Relationship

- one or more routes to which a FlowFile is transferred from a Processor

Connection

- a link between Processors that represents one or more Relationships

Controller Service

- an extension that provides information for use by other components; for

example, a service to: configure SSL, configure Greenplum Database connection properties, or

serialize CSV data into a record-oriented format

A dataflow that you create with the Apache NiFi user interface to load data into Greenplum will link multiple

processors, one of which will be the Connector PutGreenplumRecord processor.

This topic provides a basic introduction to using the Apache NiFi user interface to create a dataflow. Refer

to the Apache NiFi User Interface Documentation for detailed information about using this interface.

Launching the Apache NiFi User Interface

The Apache NiFi user interface is an interactive interface through which you create and manage automated

dataflows. You run the NiFi user interface in a browser window by specifying the following URL:

http://<nifi_hostname>:<nifi_port>/nifi.

The default NiFi port number is 8080. To determine the port number for your installation, examine the

nifi.web.http.port property setting in the $NIFI_HOME/conf/nifi.properties file and note the value.

If you are running NiFi on your local system using the default port, entering the following URL in a browser

window to launch the NiFi user interface:

http://localhost:8080/nifi

When you start the NiFi user interface, you are presented a canvas on which you create your dataflow.

Tanzu Greenplum Connector for Apache NiFi

The Apache NiFi canvas includes:

Components Toolbar

that consists of the components that you can drag and drop on to the NiFi

canvas.

Operate Palette

that includes buttons that allow you to manage the flow; you can configure,

activate/deactivate, start/stop, or delete a component. You can also manage user access and

configure system properties from this palette.

Status Bar

that provides runtime information about the flow, including thread counts, data transfer

amounts, and a refresh timestamp.

Navigate Palette

that allows you to pan around the canvas.

The canvas also provides component search capabilities, and a global menu whose options allow you to

manipulate components on the canvas.

Creating a DataFlow

To create a dataflow, you drag and drop

Processor

components on to the NiFi canvas and then connect

them with a

Connection

component.

Adding a Processor to the Canvas

The

Processor

icon is located in the NiFi

Component Toolbar

Tanzu Greenplum Connector for Apache NiFi

To add a

Processor

to the canvas:

1. Drag a

Processor

component from the

Component Toolbar

to the canvas and drop it there.

The Add Processor dialog displays:

2. Choose the

Processor

you want to add by scrolling through the list, selecting a search term from

the left pane, or entering the processor name in the Filter field in the upper right corner of the

dialog.

3. Select the desired

Processor

from the table, and double-click or press ADD.

The Add Processor dialog closes, and the

Processor

component is added to the canvas.

A processor component that you add to the canvas is in the stopped state.

Tanzu Greenplum Connector for Apache NiFi

Refer to Adding Components to the Canvas in the Apache NiFi documentation for detailed information about

adding a processor.

About the Context Menu

You most often interact with a component on the canvas via its context menu, which you display by right-

clicking on the component. The menu options available vary based on the type of component and your

privileges, and include

Configure

Start

, and

Enable

Disable

items.

You can operate on one or more selected components on the canvas via the buttons on the

Operate

Palette

After you add a processor, you must

Configure

it; configuration properties are processor-specific. For

example, Configuring the Connector describes the configuration properties for the PutGreenplumRecord

processor.

Connecting Processors in a DataFlow

You initiate a relationship by connecting processors in a dataflow. Connect processors by hovering over the

source processor, clicking the connection icon (the green highlighted arrow) in the source processor,

and dragging and dropping it on to the destination processor.

The Create Connection dialog displays. You use the DETAILS and SETTINGS tabs on this dialog to

configure the connection, including its name, thresholds, prioritization, and load balance strategy.

Connecting Components in the Apache NiFi documentation describes the available connection configuration

properties.

A connection is represented on the NiFi canvas as an object between the processors, and includes a line

with a directed arrow from source to destination:

Starting the DataFlow

A processor component that you add to the canvas is in the stopped state. A processor must be enabled

and started before it can be triggered. The processor scheduling strategy determines when and how a

processor is triggered.

Start a component by right-clicking on the component and selecting

Start

from the component context

menu. Or, start multiple components by selecting each component that you want to start, and pressing the

start button in the

Operate Palette

Additional Apache NiFi References

Check out these Apache NiFi documentation references for more detailed information about the framework:

Tanzu Greenplum Connector for Apache NiFi

Apache NiFi Overview

Getting Started with Apache NiFi

Apache NiFi User Guide

Tanzu Greenplum Connector for Apache NiFi

Loading Data with the Connector

You use the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector

for Apache NiFi PutGreenplumRecord processor to load record-oriented data from any source into

Greenplum Database. You will perform the following tasks when you use the Connector to load data into a

Greenplum table:

1. Ensure that you meet the prerequisites.

2. Configure the Greenplum adapter controller service.

3. Identify the format and schema of the input data.

4. Configure the record reader controller service.

5. Create the target Greenplum Database table.

6. Build a dataflow that uses the PutGreenplumRecord processor.

7. Start the dataflow.

8. Check the load operation results.

Prerequisites

Before you set up a dataflow using the Connector, ensure that:

You have access to a running Greenplum Database cluster, and you can identify the host name or

IP address of the master host and the port number on which the master server is running if it is not

the default port (5432).

Note the Greenplum master host and port number.

You can identify the host name or IP address and port number of a running Greenplum Streaming

Server instance to which the Connector will direct load requests. Or, you configure and start a new

streaming server instance as described in Configuring and Managing the Streaming Server in the

Greenplum Streaming Server documentation.

Be sure to note the host and port number of the Greenplum Streaming Server instance.

You can identify:

The name of the Greenplum

database

The name of the Greenplum Database

table

you want to load data into, and the name of

the

schema

in which it resides.

Note: If you are running the Connector against Greenplum Streaming Server

version 1.4.x or older, you must specify “ReuseTables”: false in the gpss.json

server configuration file that you use to start the Greenplum Streaming Server.

Tanzu Greenplum Connector for Apache NiFi

The

user/role

name and

password

that you will use to access Greenplum Database. This

role must be assigned certain privileges to the Greenplum database, schema, and table as

described in Configuring Greenplum Database Role Privileges in the Greenplum Streaming

Server documentation.

You have registered the Greenplum Streaming Server extension in the Greenplum database as

described in Registering the GPSS Extension in the Greenplum Streaming Server documentation.

You can identify the Apache NiFi server host and port number.

Network connectivity exists between the Apache NiFi host(s) and the Greenplum Streaming Server

host.

Configuring the Greenplum Adapter Controller Service

The PutGreenplumRecord processor uses a controller service to manage the connection to Greenplum

Database.

In the

Prerequisites

section above, you identified the Greenplum Database master host and port number,

Greenplum Streaming Server host and port number, the Greenplum database, and the Greenplum user/role

name and password. You create and configure an instance of the GreenplumGPSSAdapter controller service

for each unique combination of these property settings as described in About the Greenplum Adapter.

Identifying the Input Data Format and Schema

The PutGreenplumRecord processor can accept record-oriented data in the Avro, CSV, Json, Parquet, and

XML formats, among others. The processor uses a Record Reader controller service to parse and

deserialize data in the incoming FlowFiles. The format of the data inside the FlowFile informs both your

choice of Record Reader, and the definition of the Greenplum Database table that the Connector loads data

into.

The Record Reader requires the schema of the input data in order to parse and deserialize it.

About the Data Schema

NiFi data records are described by a schema. The schema defines the names and types of the fields in the

input data records.

A NiFi schema definition is specified in Avro format. The example Avro schema below specifies a record

with three fields, and identifies the names and data types of the fields:

{

"name": "datatypes_record",

"type": "record",

"fields": [

{ "name": "lastname", "type": ["string", "null"] },

{ "name": "age", "type": ["int", "null"] },

{ "name": "birthdate", "type": {"type":"int", "logicalType":"date"} }

]

}

When you configure a Record Reader, you must specify the origin of the schema. A schema may be:

Inferred from the input data (auto-discovered).

Tanzu Greenplum Connector for Apache NiFi

Embedded in the input data, such as with Parquet and Avro files.

Explicitly specified when you configure the reader.

Retrieved from a schema registry.

Once you identify and specify the schema of the incoming FlowFiles, you have the information that you

need to select and configure the Record Reader and create the Greenplum Database table.

Configuring the Record Reader Controller Service

The PutGreenplumRecord processor uses a controller service to parse and deserialize incoming data in

FlowFiles.

You choose and configure a Record Reader type and instance that corresponds to the format of the data in

the incoming FlowFiles as described in About the Record Reader. You must specify the origin of the

schema when you configure the reader controller service.

Creating the Target Greenplum Table

You specify the name of the target Greenplum table when you configure the PutGreenplumRecord

processor. You must create this table before you initiate a NiFi dataflow; the Connector does not create the

table for you.

The data types that you specify for the target Greenplum Database table columns must match the data

types of the input FlowFile record fields. You can reference the schema of the input records, or the data

itself, to identify its type and definition.

The column names that you specify when you create the Greenplum table must match the input FlowFile

field names with a few caveats. When you configure the PutGreenplumRecord processor, you specify if and

how you want the Connector to translate the field names to Greenplum table column names. You can:

Turn case-sensitivity of the translation on/off.

Specify the behaviour of the Connector when there is no matching table column for one or more

FlowFile record fields.

Specify the behaviour of the Connector when there is no matching FlowFile record field for one or

more table columns.

Specifying Field and Column Name Mappings describes the available field-to-column translation

configuration options for the Connector.

The Greenplum table may be the target of INSERT, UPDATE, or MERGE operations. To update or merge

data in a Greenplum table, you must be able to identify a set of table columns that uniquely identifies a row

in the table. The Connector uses these Match Columns to locate existing table rows. About the Insert,

Merge, and Update Properties further describes these operations and configuration properties.

Building the DataFlow

Note: Apache NiFi must read the whole FlowFile to infer the schema from the input data;

this is often inefficient.

Tanzu Greenplum Connector for Apache NiFi

When you build a dataflow, you drag components from the NiFi toolbar to the canvas, configure the

components, and then connect them together.

When you build a dataflow using the PutGreenplumRecord processor, you configure it as described above,

and in Configuring the Connector.

Remember to define the success, failure, and retry relationships for the FlowFiles processed by the

PutGreenplumRecord, or auto-terminate them.

Starting the DataFlow

A component that you add to a dataflow is in the stopped state. You must start all linked components to

initiate the flow.

Starting a component may also trigger a dataflow. In other cases, some external event triggers a flow, such

as a new file added to a directory, or a new message emitted by some external data source.

Checking the Load Operation Results

When a load operation succeeds, the PutGreenplumRecord processor emits a SEND provenance event.

These events are displayed in the global or processor-specific Data Provenance dialog.

You can detect load operation errors in one or more of the following ways:

View messages written to the $NIFI_HOME/logs/nifi-app.log file.

When the upper right corner of the processor component displays a red square, the processor has

encountered one or more warnings or errors. Hovering over the red square pops up a dialog that

displays the recent warnings and errors emitted by the processor.

If you have configured FlowFile routing to a failure relationship, examine the connection and

downstream processor components.

You can also query the target Greenplum table to verify the load.

Tanzu Greenplum Connector for Apache NiFi

Configuring the Connector

You use the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector

for Apache NiFi PutGreenplumRecord processor to load record-oriented data from any source into

Greenplum Database.

The PutGreenplumRecord processor accepts record-based FlowFiles, sending the data to the Greenplum

Streaming Server to write to Greenplum Database. When you configure the processor, you must identify the

type and instance of the RecordReader that corresponds to the format of the data contained in incoming

FlowFiles, the Greenplum connection specifics, and the Greenplum schema and table.

The default load mode for the Connector is to insert data into Greenplum. You can configure the processor

to merge or update data instead, and configuration properties for field to column translation and mappings

allow you further specify these operations.

You configure the PutGreenplumRecord processor via the Configure Processor dialog. This dialog

includes SETTINGS, SCHEDULING, PROPERTIES, and COMMENTS tabs.

PutGreenplumRecord Settings

The SETTINGS tab specifies FlowFile routing and timeouts for the processor. You can also use this tab to

change the name of the processor and activate/deactivate the processor.

The Settings Tab documentation in the Apache NiFi User Guide describes the configuration options on this

tab.

PutGreenplumRecord Schedule

The SCHEDULING tab specifies the scheduling strategy, run schedule, and concurrency options for the

processor.

When you set Concurrent Tasks to a value greater than one, the processor runs with the specified number

of threads. The single PutGreenplumRecord processor instance will process multiple flow files concurrently,

each managed by its own session.

The Scheduling Tab documentation in the Apache NiFi User Guide describes the configuration properties on

this tab.

PutGreenplumRecord Properties

The PROPERTIES tab of the Configure Processor dialog identifies the PutGreenplumRecord processor

configuration properties.

Tanzu Greenplum Connector for Apache NiFi

The Connector utilizes default values for many of the PutGreenplumRecord properties. You are required to

set the Record Reader, Greenplum Adapter, and Greenplum Table Name property values.

The PutGreenplumRecord processor configuration properties are listed and further described in the table

and topics below:

Property

Name Description Default Value

Record

Reader

The

controller service

that deserializes the input FlowFile. Required.

Greenplum

Adapter

The

controller service

that identifies and manages the Greenplum Database and

Greenplum Streaming Server connection parameters. Required.

Schema

Name

The name of the Greenplum Database schema in which the target table resides.

Required.

public

Table Name The name of the target Greenplum table in which to load the data. Required.

Operation

Type

The type of load operation: INSERT, UPDATE, or MERGE. Required. INSERT

Match

Columns

The Greenplum table columns to match with the FlowFile record data. Required for the

UPDATE and MERGE operation types.

Translate

Field Names

Boolean value that specifies if the Connector translates input FlowFile field names to

Greenplum table column names. When true, the Connector uses case insensitive

matching and ignores underscores. When false, the Connector does not translate, and

field and column names must match exactly.

true

Tanzu Greenplum Connector for Apache NiFi

Property

Name Description Default Value

Unmatched

Field Behavior

Specifies the Connector’s behavior when an incoming FlowFile record has a field that

does not map to a column in the Greenplum table.

Ignore

Unmatched

Fields

Unmatched

Column

Behavior

Specifies the Connector’s behavior when an incoming FlowFile record does not have a

field mapping for every one of the Greenplum table columns.

Fail on

Unmatched

Columns

Rollback On

Failure

Boolean value that specifies whether or not the Connector should roll back when it

encounters an error processing a FlowFile.

false

Maximum

Record Batch

Size

Specifies the maximum number of records in each batch of data that the Connector will

write to Greenplum. The Connector stores the batch in memory until it reaches this size.

0 (write all

records in a

single

transaction)

About the Insert, Merge, and Update Properties

The Connector supports inserting, merging, and updating records from a FlowFile into a Greenplum

Database table. You use the Operation Type property to specify the load mode:

Mode Description

INSERT Insert records as new rows into the Greenplum table (the default mode).

MERGE Use Match Columns to match records to existing table rows, and update these rows with the data from the

records. A record with no matching database row is inserted into the Greenplum table as new row.

UPDATE Use Match Columns to match records to existing table rows, and update these Greenplum Table rows with

data from the records.

Use

operation.type

Attribute

Obtain the load mode from an operation.type attribute in the FlowFile.

When Operation Type is UPDATE or MERGE, you must specify one or more Match Columns, a comma-

separated list of column names that uniquely identifies a row in the Greenplum table. The Connector ignores

the Match Columns property when the Operation Type is INSERT.

Specifying Field and Column Mapping Properties

The Connector exposes properties that allow you to choose how you want the Connector to map FlowFile

record fields to Greenplum Database table columns.

The Translate Field Name property is a boolean value that specifies if the Connector translates field

names in the FlowFile record into column names in the Greenplum table. The default value is true; the

processor uses case-insensitive matching and ignores underscores when it translates field names into

column names. When the value is false, the FlowFile field names must match the Greenplum table column

names exactly, or the column value will not be updated.

When an incoming FlowFile record has a field that does not map to any of the columns in the Greenplum

table, set the Unmatched Field Behaviour property to specify how the Connector should handle the

situation:

Tanzu Greenplum Connector for Apache NiFi

Ignore Unmatched Fields - (the default) The Connector ignores any field in the FlowFile record

that cannot be mapped to a column in the Greenplum table.

Fail on Unmatched Fields - The Connector routes the FlowFile to the failure relationship when

the record has any field that cannot be mapped to a column in the table.

Reference Parameter

If an incoming FlowFile record does not have a field mapping for every one of the columns in the Greenplum

table, set the Unmatched Column Behavior property to specify how the Connector should handle the

situation:

Ignore Unmatched Columns - The Connector assumes that a column in the table that does not

have a matching field in the record is not required.

Warn on Unmatched Columns - The Connector assumes that a column in the table that does not

have a matching field in the record is not required, and the Connector logs a warning.

Fail on Unmatched Columns - (the default) A flow fails when a column exists in the table and

there is no matching field in the record. The Connector also logs an error.

Reference Parameter

Specifying Failure Rollback Behavior

The Connector distinguishes between the transient and the non-recoverable errors that it encounters.

Transient errors are those that may succeed on a later retry, such as a connection attempt to Greenplum

Database. Conversely, a FlowFile that contained bad input data would continue to fail when retried.

The Connector applies success or failure at the FlowFile level. That is, the Connector considers a write

operation successful if all records in a single FlowFile are written to the Greenplum Database table with no

errors. If a single record in the FlowFile fails to write for some reason (say the data is malformed), none of

the records in the FlowFile are written to Greenplum, and the Connector considers the operation failed.

Rollback On Failure is a boolean property that specifies whether or not the Connector rolls back the NiFi

session when it encounters a failure processing a FlowFile.

The default Rollback On Failure setting is false. When the Connector encounters an error while

processing a FlowFile, the FlowFile is routed to the failure or retry relationship based on the error type,

and the processor continues processing the next FlowFile.

When Rollback On Failure is true, the Connector:

Stops further processing a FlowFile when it encounters an error,

Rolls back the NiFi session; this penalizes the FlowFile and returns it to the incoming queue, and

Continues processing the next FlowFile.

The rolled back FlowFile may be processed repeatedly by the Connector until it is processed successfully

or removed by other means.

Be sure to set an adequate SETTINGS Yield Duration for the processor to avoid retrying too frequently.

Choosing a Maximum Record Batch Size

For each FlowFile it receives, the Connector:

Tanzu Greenplum Connector for Apache NiFi

Opens and prepares the table for writing,

Performs one or more writes, and

Closes/commits the write.

The maximum number of records in a write call that the Connector makes to the Greenplum Streaming

Server is determined by the Maximum Record Batch Size that you specify for the processor.

The default value is zero (0); there is no limit on the batch size, and the Connector accumulates all FlowFile

content in memory before it writes to Greenplum in a single transaction.

About the Record Reader

The PutGreenplumRecord Record Reader property identifies the controller service that the processor uses

to deserialize incoming data into NiFi records. You select an appropriate reader based on the format of the

input data.

(This older Record-Oriented Data with NiFi blog describes the Apache NiFi processors and controller

services available for working with record-oriented data.)

You configure a PutGreenplumRecord Record Reader via the processor configuration dialog,

PROPERTIES tab. You can also add a new reader instance via the

Operate Palette

configuration dialog

CONTROLLER SERVICES tab.

Note: This default behavior may be inefficient for FlowFiles that consist of large and/or

many records.

Tanzu Greenplum Connector for Apache NiFi

The Connector supports all compatible Record Reader controller services, and has been specifically tested

with certain data formats. These readers, data formats, and their schema origins are identified in the table

below.

Reader

Name

Data

Format Schema Description

AvroReader Avro Embedded in the Avro data, obtained from a

schema registry, or explicitly specified.

Each Avro record is deserialized to a NiFi record.

CSVReader CSV Inferred from the data, obtained from a

schema registry, or explicitly specified.

Each row is deserialized to a NiFi record.

JsonTreeR

eader

JSON Inferred from the data, obtained from a

schema registry, or explicitly specified.

Each JSON record is deserialized to a NiFi

record.

ParquetRea

der

Parquet Embedded in the Parquet data. Each Parquet record is deserialized to a NiFi

record.

XMLReader XML Inferred from the data, obtained from a

schema registry, or explicitly specified.

The second level (within enclosing root tag) of

XML data is deserialized into a NiFi record.

Tanzu Greenplum Connector for Apache NiFi

About the Greenplum Adapter

The PutGreenplumRecord Greenplum Adapter property identifies the controller service that specifies and

manages the connection to Greenplum Database. The only compatible controller service for this function is

named GreenplumGPSSAdapter.

You configure a PutRecordProcessor Greenplum Adapter via the processor configuration dialog,

PROPERTIES tab. You can also add a new adapter instance via the

Operate Palette

configuration dialog

CONTROLLER SERVICES tab.

When you configure the GreenplumGPSSAdapter controller service, you identify a Greenplum Streaming

Server instance host and port. You also specify Greenplum Database connection properties including

master host and port, database and user names, and user password.

While all connection properties are required, the Connector utilizes default values for many.

Property Name Description Default

Value

Greenplum Streaming Server

Host

The name or IP address of the host on which the GPSS instance is running.

Required.

Greenplum Streaming Server

Port

The GPSS port number. Required. 5000

Greenplum Database Master

Host

The name or IP address of the Greenplum Database master host.

Required.

Tanzu Greenplum Connector for Apache NiFi

Property Name Description Default

Value

Greenplum Database Master

Port

The Greenplum master server port number. Required. 5432

Greenplum Database Name The name of the Greenplum database. Required. postgres

Greenplum User Name The Greenplum user/role name to use to connect to the database. gpadmin

Greenplum User Password The password for the Greenplum user/role. Required.

You can re-use a Greenplum Adapter controller service for any dataflows that use the specified Greenplum

Streaming Server instance to load data to the same Greenplum master host and database as the specified

user.

Tanzu Greenplum Connector for Apache NiFi

Example - Loading CSV Data from the File

System

In this example, you use the VMware Greenplum Connector for Apache NiFi to load CSV-format data into

Greenplum Database.

The CSV data represents department expense records, and includes department identifier (integer), month

(integer), and expenses (decimal) fields. For example, a record for a department with identifier 123 that

spent $456.78 in the month of September follows:

"123","09","456.78"

A record with the same department identifier and month identifies a new expense total for the month,

replacing the previous amount.

You will use the Apache NiFi user interface to create a dataflow between the GetFile and

PutGreenplumRecord processors.

In this flow:

Tanzu Greenplum Connector for Apache NiFi

The GetFile processor reads CSV files from the /tmp/gcan_data directory on the NiFi system

and generates record-based FlowFiles.

The PutGreenplumRecord processor writes the data that it receives to a Greenplum Database table

named gcan_dept_expense located in the public schema of a database named testdb.

The department identifier and month fields together uniquely identify a table row.

The write to Greenplum should specify the MERGE operation type; if an entry for the

department/month does not exist, insert a new row into the table. If an entry for the department for

the month already exists, replace the expenses amount with the new value.

You will explicitly specify the input FlowFile schema.

Prerequisites

Before you start this procedure, ensure that you:

Have access to a running Greenplum Database cluster.

Have access to a running Greenplum Streaming Server instance, or the privileges required to start

an instance.

Have met the Prerequisites identified in the

topic.

For simplicity, this example assumes that Apache NiFi, Greenplum Database, and the Greenplum

Streaming Server are running on the same host.

Process

Step 1: Prepare the Example Environment

Step 2: Add and Configure the GetFile Processor

Step 3: Configure a GreenplumGPSSAdapter Controller Service

Step 4: Identify the Input Data Source, Format, and Schema

Step 5: Configure a Record Reader Controller Service

Step 6: Add and Configure the PutGreenplumRecord Processor

Step 7: Connect and Start the Processors

Step 8: Create the Greenplum Database and Table

Step 9: Trigger the Flow and Check Results

Step 1: Prepare the Example Environment

In this step, you create sample data files.

1. Log in to your Apache NiFi client system.

$ ssh user@nifihost

user@nifihost$

2. Create a working directory. For example:

Tanzu Greenplum Connector for Apache NiFi

user@nifihost$ mkdir gcan_work

user@nifihost$ cd gcan_work

3. Prepare some sample data:

1. Write some data into a CSV file named sample1.csv:

user@nifihost$ echo '"dept_id","month","expenses"

"1313131","12","1313.13"

"3535353","11","761.35"

"7979797","10","4489.00"

"7979797","11","18.72"

"3535353","10","6001.94"

"7979797","12","173.18"

"1313131","10","492.83"

"3535353","12","81.12"

"1313131","11","368.27"' > sample1.csv

2. Write some data into a CSV file named sample2.csv:

user@nifihost$ echo '"dept_id","month","expenses"

"1313131","11","555.55"

"7979797","10","5555.55"

"2222222","12","22.22"' > sample2.csv

The data added to this file represents an expense for a new department (2222222), and

new/updated expense values for two existing departments/months.

4. Create an input directory and set the appropriate permissions:

user@nifihost$ mkdir /tmp/gcan_data

user@nifihost$ chmod a+rwx /tmp/gcan_data

You will copy the sample data files to the input directory later in this procedure.

5. Start the Apache NiFi user interface. For example, if your NiFi server is running on the local host on

port number 9050, enter the following in a web browser window:

http://localhost:9050

Step 2: Add and Configure the GetFile Processor

Perform the following steps to add and configure a GetFile processor instance:

1. Click the Processors icon in the Apache NiFi

components toolbar

and drag to the canvas.

This action opens the Add Processor dialog.

2. Search for the GetFile Processor by typing in the Filter field.

3. Click Add.

This action adds a GetFile processor component to the canvas.

4. Right-click on the component and select Configure from the context menu.

This action displays the Configure Processor dialog.

Tanzu Greenplum Connector for Apache NiFi

5. Select the PROPERTIES tab.

1. Locate the Input Directory property and set the Value to the directory that you created in

Step 1, /tmp/gcan_data.

2. Click OK.

3. APPLY the Configure Processor changes.

Step 3: Configure a GreenplumGPSSAdapter Controller

Service

Perform the steps below to configure an instance of the GreenplumGPSSAdapter controller service named

GreenplumGPSSAdapter-testdb:

1. Click on an empty area in the Apache NiFi canvas.

2. Click on the configure icon in the

Operate Palette

This action opens the NiFi Flow Configuration dialog.

3. Select the CONTROLLER SERVICES tab.

4. Click the + icon to add a new controller service.

This action opens the Add Controller Service dialog.

5. Type Greenplum in the Filter field, select the GreenplumGPSSAdapter entry, and click ADD.

This action adds a GreenplumGPSSAdapter row to the table of currently defined controller services,

and selects this row.

6. Click on the configure icon in the last column of the table to configure the service.

This action opens the Configure Controller Service dialog.

7. Select the SETTINGS tab, locate the Name field, and set the name to GreenplumGPSSAdapter-

testdb.

8. Select the PROPERTIES tab, locate the properties identified in the table below, and set each Value

as specified:

Property Name Value Comments

Greenplum Streaming Server Host localhost Enter your host

Greenplum Streaming Server Port 5000 Retain the default

Greenplum Database Master Host localhost Enter your Greenplum master host

Greenplum Database Master Port 5432 Retain the default

Greenplum Database Name testdb

Greenplum Database User Name gpadmin You can choose a different Greenplum user

Greenplum Database User Password

changeme

Enter the password

9. APPLY the Configure Controller Service changes.

Tanzu Greenplum Connector for Apache NiFi

10. Click the thunderbolt icon in the GreenplumGPSSAdapter-testdb row to enable the controller

service.

The Enable Controller Service dialog displays.

1. Click the ENABLE button.

2. Click the CLOSE button.

11. Click X in the upper right hand of the dialog to close the NiFi Flow Configuration window.

Step 4: Identify the Input Data Source, Format, and

Schema

The source of the data is the GetFile processor, and the data format is CSV.

Because the CSV file includes a header row, you could choose to have Apache NiFi infer the schema. For

this exercise, you will explicitly define and specify the schema.

As decribed above, the CSV data represents department expense records, and includes department

identifier (integer), month (integer), and expenses (decimal) fields:

"123","09","456.78"

The schema that corresponds to records of this format follows:

{

"name": "dept_expense_record",

"namespace": "nifi_csv_example",

"type": "record",

"fields": [

{ "name": "dept_id", "type": ["int", "null"] },

{ "name": "month", "type": ["int", "null"] },

{ "name": "expenses", "type": {"type": "bytes", "logicalType": "decimal", "pre

cision": 11, "scale": 2 } }

]

}

You will specify this schema when you configure a record reader controller service for a

PutGreenplumRecord processor instance.

Step 5: Configure a Record Reader Controller Service

Perform the steps below to configure an instance of a CSV record reader controller service named

CSVReader-dept-expenses:

1. Click on an empty area in the Apache NiFi canvas.

2. Click on the configure icon in the

Operate Palette

This action opens the NiFi Flow Configuration dialog.

3. Select the CONTROLLER SERVICES tab.

4. Click the + icon to add a new controller service.

This action opens the Add Controller Service dialog.

Tanzu Greenplum Connector for Apache NiFi

5. Type CSV in the Filter field, select the CSVReader entry, and click ADD.

This action adds a CSVReader row to the table of currently defined controller services, and selects

this row.

6. Click on the configure icon in the last column of the table to configure the service.

This action opens the Configure Controller Service dialog.

7. Select the SETTINGS tab, locate the Name field, and set the name to CSVReader-dept-expenses.

8. Select the PROPERTIES tab, locate the properties identified in the table below, and set each Value

as specified:

Property Name Value Comments

Schema Access

Strategy

Use ‘Schema Text’

Property

The Schema Text property value will specify the schema

definition

Treat First Line as

Header

true The first line of the file is the header

9. Locate the Schema Text property, and copy/paste the schema definition below into the Value field:

{

"name": "dept_expense_record",

"namespace": "nifi_csv_example",

"type": "record",

"fields": [

{ "name": "dept_id", "type": ["int", "null"] },

{ "name": "month", "type": ["int", "null"] },

{ "name": "expenses", "type": {"type": "bytes", "logicalType": "decimal",

"precision": 11, "scale": 2 } }

]

}

10. Retain the default values for the other properties.

11. APPLY the Configure Controller Service changes.

12. Click the thunderbolt icon in the CSVReader-dept-expenses row to enable the controller service.

The Enable Controller Service dialog displays.

1. Click the ENABLE button.

2. Click the CLOSE button.

13. Click X in the upper right hand of the dialog to close the NiFi Flow Configuration window.

Step 6: Add and Configure the PutGreenplumRecord

Processor

Perform the following steps to add and configure a PutGreenplumRecord processor instance:

1. Click the Processors icon in the Apache NiFi

components toolbar

and drag it to the canvas.

This action opens the Add Processor dialog.

2. Search for the PutGreenplumRecord Processor by typing in the Filter field.

Tanzu Greenplum Connector for Apache NiFi

3. Click Add.

This action adds a PutGreenplumRecord Processor component to the canvas.

4. Right-click on the component and select Configure from the context menu.

This action displays the Configure Processor dialog.

5. Select the SETTINGS tab.

6. Automatically terminate all relationships by checking the failure, retry, and success checkboxes.

7. Select the PROPERTIES tab.

8. Locate the Record Reader property. Click in the Value field, then select CSVReader-dept-

expenses from the drop-down menu, and click OK.

9. Locate the Greenplum Adapter property. Click in the Value field, select GreenplumGPSSAdapter-

testdb from the drop-down menu, and click OK.

10. Locate the properties identified in the table below and set each Value as specified:

Property Name Value Comments

Schema Name public Retain the default

Table Name gcan_dept_expense You will create this table in the next step

Operation Type MERGE Merge can both insert and update a table row

Match Columns dept_id, month A table row is uniquely identified by these column

values

Translate Field Names true Retain the default

Unmatched Field Behavior Ignore Unmatched Fields Retain the default

Unmatched Column

Behavior

Warn on Unmatched

Columns

Log a warning message

Rollback On Failure false Retain the default

Maximum Record Batch

Size

100

11. APPLY the Configure Processor changes.

Step 7: Connect and Start the Processors

In this step, you create a connection between the GetFile and PutGreenplumRecord processors on the

canvas, and then start the processors.

1. Hover over the GetFile component on the canvas.

2. Click the arrow icon and drag over to the PutGreenplumRecord component.

This action displays the Create Connection dialog.

3. No configuration is required; click ADD to create the connection.

A line/box that represents the connection is displayed on the NiFi canvas.

Tanzu Greenplum Connector for Apache NiFi

4. Right-click on the GetFile component and select Start from the context menu to start the

processor.

The icon next to the processor name changes to a green sideways triangle.

5. Right-click on the PutGreenplumRecord component and select Start from the context menu to start

the processor.

The icon next to the processor name changes to a green sideways triangle.

Step 8: Create the Greenplum Database and Table

In this step, you create the Greenplum database testdb if it does not yet exist, and create the target

Greenplum table.

1. Open a new terminal window, log in to the Greenplum Database master host as the gpadmin

administrative user, and set up your Greenplum environment. For example:

$ ssh gpadmin@gpmaster

gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh

2. Create a database named testdb if one does not already exist:

gpadmin@gpmaster$ createdb testdb

3. Start the psql subsystem:

gpadmin@gpmaster$ psql -d testdb

4. The Greenplum Streaming Server must be registered in the database to use the Connector. You

can register the Greenplum Streaming Server as follows:

testdb=# CREATE EXTENSION IF NOT EXISTS gpss;

This command registers the extension only if it has not been previously registered.

5. Create the target Greenplum Database table named gcan_dept_expense:

testdb=# CREATE TABLE gcan_dept_expense( dept_id int8, month int8, expenses dec

imal(11,2) );

This table definition matches the input data schema that you specified for the record reader in Step

6. Stay in the psql subsystem, you will be back.

Step 9: Trigger the Flow and Check Results

You will individually copy the sample data files to /tmp/gcan_data on the Apache NiFi system to trigger the

flow. You will check the results by observing the Apache NiFi user interface and querying the Greenplum

table.

You will also generate a sample file with bad data, trigger the flow, and check the results.

Tanzu Greenplum Connector for Apache NiFi

1. Copy the sample1.csv data file to the input directory:

user@nifihost$ cp gcan_work/sample1.csv /tmp/gcan_data/

2. Examine the GetFile and PutGreenplumRecord processor components on the NiFi canvas, and

notice when their statistics update.

3. Examine the contents of the Greenplum Database table. Enter the following command in the psql

terminal session that you used earlier:

testdb=# SELECT * FROM gcan_dept_expense ORDER BY dept_id, month;

dept_id | month | expenses

---------+-------+----------

1313131 | 10 | 492.83

1313131 | 11 | 555.55

1313131 | 12 | 1313.13

3535353 | 10 | 6001.94

3535353 | 11 | 761.35

3535353 | 12 | 81.12

7979797 | 10 | 5555.55

7979797 | 11 | 18.72

7979797 | 12 | 173.18

(9 rows)

4. Copy the sample2.csv data file to the input directory:

user@nifihost$ cp gcan_work/sample2.csv /tmp/gcan_data/

5. Wait until flow between the GetFile and PutGreenplumRecord processor components is triggered.

6. Query the table again:

testdb=# SELECT * FROM gcan_dept_expense ORDER BY dept_id, month;

dept_id | month | expenses

---------+-------+----------

1313131 | 10 | 492.83

1313131 | 11 | 555.55

1313131 | 12 | 1313.13

2222222 | 12 | 22.22

3535353 | 10 | 6001.94

3535353 | 11 | 761.35

3535353 | 12 | 81.12

7979797 | 10 | 5555.55

7979797 | 11 | 18.72

7979797 | 12 | 173.18

(10 rows)

Notice the new row for department 2222222, and the updated expenses values for department

1313131, month 11 and department 7979797, month 10.

7. Write a sample file with bad input data directly to the input directory:

user@nifihost$ echo '"dept_id","month","expenses"

"1313131","12","12222.22"

"7979797","zz","5555.55"' > /tmp/gcan_data/sample3.csv

Tanzu Greenplum Connector for Apache NiFi

This data includes the value zz in what should be an int field.

8. Observe the NiFi canvas and wait for the flow to triger. Notice that the PutGreenplumRecord

processor canvas component eventually displays a red box in the right-hand corner. Hover over the

red box and view the warning message. The processor generates a NumberFormatException when

attempting to write the second record to the Greenplum table.

9. Query the table again. In this query, filter on the department identifier in the first record of the

sample3.csv data file to display only the table rows associated with that department:

testdb=# SELECT * FROM gcan_dept_expense WHERE dept_id=1313131 ORDER BY month;

dept_id | month | expenses

---------+-------+----------

1313131 | 10 | 492.83

1313131 | 11 | 555.55

1313131 | 12 | 1313.13

(3 rows)

Notice that the first record in sample3.csv, even though correctly formatted, was not written to the

table. The Connector must process all records in the FlowFile successfully before it will write the

batch to Greenplum Database.

Tanzu Greenplum Connector for Apache NiFi

0 views·41 pages

Tanzu Greenplum Connector for Apache NiFi PDF Free Download

Tanzu Greenplum Connector for Apache NiFi PDF free Download. Think more deeply and widely.

Uploaded by j.taylor on 4/10/2026

/41

100%