Data Pipelines with Apache Airflow PDF Free Download

Name: Data Pipelines with Apache Airflow PDF
Author: Maria Welch

1 / 482

0 views•482 pages

Data Pipelines with Apache Airflow PDF Free Download

Data Pipelines with Apache Airflow PDF free Download. Think more deeply and widely.

MANNING

Bas Harenslak

Julian de Ruiter

Pipeline as DAG

Task 1

Task 2

Task 3

Task 4

Schedule interval = @daily

DAG ﬁle

(Python)

Dependency between tasks,

indicating task 3 must run

before task 4

Which schedule to use

for running the DAG

Represents a

task/operation

we want to run

Data Pipelines with Apache Airflow

Data Pipelines

with Apache Airflow

BAS HARENSLAK

AND JULIAN DE RUITER

MANNING

SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit

www.manning.com. The publisher offers discounts on this book when ordered in quantity.

For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 761

Shelter Island, NY 11964

Email: orders@manning.com

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in

any form or by means electronic, mechanical, photocopying, or otherwise, without prior written

permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are

claimed as trademarks. Where those designations appear in the book, and Manning Publications

was aware of a trademark claim, the designations have been printed in initial caps or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have

the books we publish printed on acid-free paper, and we exert our best efforts to that end.

Recognizing also our responsibility to conserve the resources of our planet, Manning books

are printed on paper that is at least 15 percent recycled and processed without the use of

elemental chlorine.

Development editor: Tricia Louvar

Technical development editor: Arthur Zubarev

Manning Publications Co. Review editor: Aleks Dragosavljevic´

20 Baldwin Road Production editor: Deirdre S. Hiam

PO Box 761 Copy editor: Michele Mitchell

Shelter Island, NY 11964 Proofreader: Keri Hales

Technical proofreader: Al Krinker

Typesetter: Dennis Dalinnik

Cover designer: Marija Tudor

ISBN: 9781617296901

Printed in the United States of America

brief contents

PART 1 GETTING STARTED ........................................................1

1■Meet Apache Airflow 3

2■Anatomy of an Airflow DAG 20

3■Scheduling in Airflow 40

4■Templating tasks using the Airflow context 60

5■Defining dependencies between tasks 85

PART 2 BEYOND THE BASICS ..................................................113

6■Triggering workflows 115

7■Communicating with external systems 135

8■Building custom components 157

9■Testing 186

10 ■Running tasks in containers 220

PART 3 AIRFLOW IN PRACTICE ...............................................253

11 ■Best practices 255

12 ■Operating Airflow in production 281

BRIEF CONTENTSvi

13 ■Securing Airflow 322

14 ■Project: Finding the fastest way to get around NYC 344

PART 4IN THE CLOUDS .........................................................365

15 ■Airflow in the clouds 367

16 ■Airflow on AWS 375

17 ■Airflow on Azure 394

18 ■Airflow in GCP 412

vii

contents

preface xv

acknowledgments xvii

about this book xix

about the authors xxiii

about the cover illustration xxiv

PART 1GETTING STARTED ..............................................1

1 Meet Apache Airflow 3

1.1 Introducing data pipelines 4

Data pipelines as graphs 4 ■Executing a pipeline graph 6

Pipeline graphs vs. sequential scripts 6 ■Running pipeline using

workflow managers 9

1.2 Introducing Airflow 10

Defining pipelines flexibly in (Python) code 10 ■Scheduling and

executing pipelines 11 ■Monitoring and handling failures 13

Incremental loading and backfilling 15

1.3 When to use Airflow 17

Reasons to choose Airflow 17 ■Reasons not to choose Airflow 17

1.4 The rest of this book 18

CONTENTS

viii

2 Anatomy of an Airflow DAG 20

2.1 Collecting data from numerous sources 21

Exploring the data 21

2.2 Writing your first Airflow DAG 22

Tasks vs. operators 26 ■Running arbitrary Python code 27

2.3 Running a DAG in Airflow 29

Running Airflow in a Python environment 29 ■Running Airflow

in Docker containers 30 ■Inspecting the Airflow UI 31

2.4 Running at regular intervals 33

2.5 Handling failing tasks 36

3 Scheduling in Airflow 40

3.1 An example: Processing user events 41

3.2 Running at regular intervals 42

Defining scheduling intervals 42 ■Cron-based intervals 44

Frequency-based intervals 46

3.3 Processing data incrementally 46

Fetching events incrementally 46 ■Dynamic time references using

execution dates 48 ■Partitioning your data 50

3.4 Understanding Airflow’s execution dates 52

Executing work in fixed-length intervals 52

3.5 Using backfilling to fill in past gaps 54

Executing work back in time 54

3.6 Best practices for designing tasks 55

Atomicity 55 ■Idempotency 57

4 Templating tasks using the Airflow context 60

4.1 Inspecting data for processing with Airflow 61

Determining how to load incremental data 61

4.2 Task context and Jinja templating 63

Templating operator arguments 64 ■What is available for

templating? 66 ■Templating the PythonOperator 68

Providing variables to the PythonOperator 73 ■Inspecting

templated arguments 75

4.3 Hooking up other systems 77

CONTENTS ix

5 Defining dependencies between tasks 85

5.1 Basic dependencies 86

Linear dependencies 86 ■Fan-in/-out dependencies 87

5.2 Branching 90

Branching within tasks 90 ■Branching within the DAG 92

5.3 Conditional tasks 97

Conditions within tasks 97 ■Making tasks conditional 98

Using built-in operators 100

5.4 More about trigger rules 100

What is a trigger rule? 101 ■The effect of failures 102

Other trigger rules 103

5.5 Sharing data between tasks 104

Sharing data using XComs 104 ■When (not) to use

XComs 107 ■Using custom XCom backends 108

5.6 Chaining Python tasks with the Taskflow API 108

Simplifying Python tasks with the Taskflow API 109

When (not) to use the Taskflow API 111

PART 2BEYOND THE BASICS ........................................113

6 Triggering workflows 115

6.1 Polling conditions with sensors 116

Polling custom conditions 119 ■Sensors outside the

happy flow 120

6.2 Triggering other DAGs 122

Backfilling with the TriggerDagRunOperator 126

Polling the state of other DAGs 127

6.3 Starting workflows with REST/CLI 131

7 Communicating with external systems 135

7.1 Connecting to cloud services 136

Installing extra dependencies 137 ■Developing a machine

learning model 137 ■Developing locally with external

systems 143

7.2 Moving data from between systems 150

Implementing a PostgresToS3Operator 151 ■Outsourcing

the heavy work 155

CONTENTS

8 Building custom components 157

8.1 Starting with a PythonOperator 158

Simulating a movie rating API 158 ■Fetching ratings from

the API 161 ■Building the actual DAG 164

8.2 Building a custom hook 166

Designing a custom hook 166 ■Building our DAG with

the MovielensHook 172

8.3 Building a custom operator 173

Defining a custom operator 174 ■Building an operator for

fetching ratings 175

8.4 Building custom sensors 178

8.5 Packaging your components 181

Bootstrapping a Python package 182 ■Installing your

package 184

9 Testing 186

9.1 Getting started with testing 187

Integrity testing all DAGs 187 ■Setting up a CI/CD

pipeline 193 ■Writing unit tests 195 ■Pytest project

structure 196 ■Testing with files on disk 201

9.2 Working with DAGs and task context in tests 203

Working with external systems 208

9.3 Using tests for development 215

Testing complete DAGs 217

9.4 Emulate production environments with Whirl 218

9.5 Create DTAP environments 219

10 Running tasks in containers 220

10.1 Challenges of many different operators 221

Operator interfaces and implementations 221 ■Complex and

conflicting dependencies 222 ■Moving toward a generic

operator 223

10.2 Introducing containers 223

What are containers? 223 ■Running our first Docker

container 224 ■Creating a Docker image 225

Persisting data using volumes 227

10.3 Containers and Airflow 230

Tasks in containers 230 ■Why use containers? 231

CONTENTS xi

10.4 Running tasks in Docker 232

Introducing the DockerOperator 232 ■Creating container images

for tasks 233 ■Building a DAG with Docker tasks 236

Docker-based workflow 239

10.5 Running tasks in Kubernetes 240

Introducing Kubernetes 240 ■Setting up Kubernetes 242

Using the KubernetesPodOperator 245 ■Diagnosing Kubernetes-

related issues 248 ■Differences with Docker-based

workflows 250

PART 3AIRFLOW IN PRACTICE .....................................253

11 Best practices 255

11.1 Writing clean DAGs 256

Use style conventions 256 ■Manage credentials centrally 260

Specify configuration details consistently 261 ■Avoid doing any

computation in your DAG definition 263 ■Use factories to

generate common patterns 265 ■Group related tasks using

task groups 269 ■Create new DAGs for big changes 270

11.2 Designing reproducible tasks 270

Always require tasks to be idempotent 271 ■Task results

should be deterministic 271 ■Design tasks using functional

paradigms 272

11.3 Handling data efficiently 272

Limit the amount of data being processed 272 ■Incremental

loading/processing 274 ■Cache intermediate data 275

Don’t store data on local file systems 275 ■Offload work

to external/source systems 276

11.4 Managing your resources 276

Managing concurrency using pools 276 ■Detecting long-running

tasks using SLAs and alerts 278

12 Operating Airflow in production 281

12.1 Airflow architectures 282

Which executor is right for me? 284 ■Configuring a metastore

for Airflow 284 ■A closer look at the scheduler 286

12.2 Installing each executor 290

Setting up the SequentialExecutor 291 ■Setting up the

LocalExecutor 292 ■Setting up the CeleryExecutor 293

Setting up the KubernetesExecutor 296

CONTENTS

xii

12.3 Capturing logs of all Airflow processes 302

Capturing the webserver output 303 ■Capturing the scheduler

output 303 ■Capturing task logs 304 ■Sending logs to remote

storage 305

12.4 Visualizing and monitoring Airflow metrics 305

Collecting metrics from Airflow 306 ■Configuring Airflow to send

metrics 307 ■Configuring Prometheus to collect metrics 308

Creating dashboards with Grafana 310 ■What should you

monitor? 312

12.5 How to get notified of a failing task 314

Alerting within DAGs and operators 314 ■Defining service-level

agreements 316 ■Scalability and performance 318 ■Controlling

the maximum number of running tasks 318 ■System performance

configurations 319 ■Running multiple schedulers 320

13 Securing Airflow 322

13.1 Securing the Airflow web interface 323

Adding users to the RBAC interface 324 ■Configuring the RBAC

interface 327

13.2 Encrypting data at rest 327

Creating a Fernet key 328

13.3 Connecting with an LDAP service 330

Understanding LDAP 330 ■Fetching users from an LDAP

service 333

13.4 Encrypting traffic to the webserver 333

Understanding HTTPS 334 ■Configuring a certificate for

HTTPS 336

13.5 Fetching credentials from secret management

systems 339

14 Project: Finding the fastest way to get around NYC 344

14.1 Understanding the data 347

Yellow Cab file share 348 ■Citi Bike REST API 348

Deciding on a plan of approach 350

14.2 Extracting the data 350

Downloading Citi Bike data 351 ■Downloading Yellow Cab

data 353

14.3 Applying similar transformations to data 356

CONTENTS xiii

14.4 Structuring a data pipeline 360

14.5 Developing idempotent data pipelines 361

PART 4IN THE CLOUDS ...............................................365

15 Airflow in the clouds 367

15.1 Designing (cloud) deployment strategies 368

15.2 Cloud-specific operators and hooks 369

15.3 Managed services 370

Astronomer.io 371 ■Google Cloud Composer 371

Amazon Managed Workflows for Apache Airflow 372

15.4 Choosing a deployment strategy 372

16 Airflow on AWS 375

16.1 Deploying Airflow in AWS 375

Picking cloud services 376 ■Designing the network 377

Adding DAG syncing 378 ■Scaling with the CeleryExecutor 378

Further steps 380

16.2 AWS-specific hooks and operators 381

16.3 Use case: Serverless movie ranking with AWS Athena 383

Overview 383 ■Setting up resources 384 ■Building the

DAG 387 ■Cleaning up 393

17 Airflow on Azure 394

17.1 Deploying Airflow in Azure 394

Picking services 395 ■Designing the network 395

Scaling with the CeleryExecutor 397 ■Further steps 398

17.2 Azure-specific hooks/operators 398

17.3 Example: Serverless movie ranking with Azure

Synapse 400

Overview 400 ■Setting up resources 401 ■Building the

DAG 404 ■Cleaning up 410

18 Airflow in GCP 412

18.1 Deploying Airflow in GCP 413

Picking services 413 ■Deploying on GKE with Helm 415

Integrating with Google services 417 ■Designing the

network 419 ■Scaling with the CeleryExecutor 419

CONTENTS

xiv

18.2 GCP-specific hooks and operators 422

18.3 Use case: Serverless movie ranking on GCP 427

Uploading to GCS 428 ■Getting data into BigQuery 429

Extracting top ratings 432

appendix A Running code samples 436

appendix B Package structures Airflow 1 and 2 439

appendix C Prometheus metric mapping 443

index 445

preface

We’ve both been fortunate to be data engineers in interesting and challenging times.

For better or worse, many companies and organizations are realizing that data plays a

key role in managing and improving their operations. Recent developments in

machine learning and AI have opened a slew of new opportunities to capitalize on.

However, adopting data-centric processes is often difficult, as it generally requires

coordinating jobs across many different heterogeneous systems and tying everything

together in a nice, timely fashion for the next analysis or product deployment.

In 2014, engineers at Airbnb recognized the challenges of managing complex data

workflows within the company. To address those challenges, they started developing

Airflow: an open source solution that allowed them to write and schedule workflows

and monitor workflow runs using the built-in web interface.

The success of the Airflow project quickly led to its adoption under the Apache

Software Foundation, first as an incubator project in 2016 and later as a top-level proj-

ect in 2019. As a result, many large companies now rely on Airflow for orchestrating

numerous critical data processes.

Working as consultants at GoDataDriven, we’ve helped various clients adopt Air-

flow as a key component in projects involving the building of data lakes/platforms,

machine learning models, and so on. In doing so, we realized that handing over these

solutions can be challenging, as complex tools like Airflow can be difficult to learn

overnight. For this reason, we also developed an Airflow training program at GoData-

Driven, and have frequently organized and participated in meetings to share our

knowledge, views, and even some open source packages. Combined, these efforts have

PREFACE

xvi

helped us explore the intricacies of working with Airflow, which were not always easy

to understand using the documentation available to us.

In this book, we aim to provide a comprehensive introduction to Airflow that cov-

ers everything from building simple workflows to developing custom components and

designing/managing Airflow deployments. We intend to complement many of the

excellent blogs and other online documentation by bringing several topics together in

one place, using a concise and easy-to-follow format. In doing so, we hope to kickstart

your adventures with Airflow by building on top of the experience we’ve gained

through diverse challenges over the past years.

xvii

acknowledgments

This book would not have been possible without the support of many amazing people.

Colleagues from GoDataDriven and personal friends supported us and provided valu-

able suggestions and critical insights. In addition, Manning Early Access Program

(MEAP) readers posted useful comments in the online forum.

Reviewers from the development process also contributed helpful feedback: Al

Krinker, Clifford Thurber, Daniel Lamblin, David Krief, Eric Platon, Felipe Ortega,

Jason Rendel, Jeremy Chen, Jiri Pik, Jonathan Wood, Karthik Sirasanagandla, Kent R.

Spillner, Lin Chen, Philip Best, Philip Patterson, Rambabu Posa, Richard Meinsen,

Robert G. Gimbel, Roman Pavlov, Salvatore Campagna, Sebastián Palma Mardones,

Thorsten Weber, Ursin Stauss, and Vlad Navitski.

At Manning, we owe special thanks to Brian Sawyer, our acquisitions editor, who

helped us shape the initial book proposal and believed in us being able to see it

through; Tricia Louvar, our development editor, who was very patient in answering all

our questions and concerns, provided critical feedback on each of our draft chapters,

and was an essential guide for us throughout this entire journey; and to the rest of the

staff as well: Deirdre Hiam, our project editor; Michele Mitchell, our copyeditor; Keri

Hales, our proofreader; and Al Krinker, our technical proofreader.

Bas Harenslak

I would like to thank my friends and family for their patience and support during this

year-and-a-half adventure that developed from a side project into countless days,

nights, and weekends. Stephanie, thank you for always putting up with me working at

ACKNOWLEDGMENTS

xviii

the computer. Miriam, Gerd, and Lotte, thank you for your patience and belief in me

while writing this book. I would also like to thank the team at GoDataDriven for their

support and dedication to always learn and improve, I could not have imagined being

the author of a book when I started working five years ago.

Julian de Ruiter

First and foremost, I’d like to thank my wife, Anne Paulien, and my son, Dexter, for

their endless patience during the many hours that I spent doing “just a little more

work” on the book. This book would not have been possible without their unwavering

support. In the same vein, I’d also like to thank our family and friends for their sup-

port and trust. Finally, I’d like to thank our colleagues at GoDataDriven for their

advice and encouragement, from whom I’ve also learned an incredible amount in the

past years.

xix

about this book

Data Pipelines with Apache Airflow was written to help you implement data-oriented work-

flows (or pipelines) using Airflow. The book begins with the concepts and mechanics

involved in programmatically building workflows for Apache Airflow using the Python

programming language. Then the book switches to more in-depth topics such as

extending Airflow by building your own custom components and comprehensively

testing your workflows. The final part of the book focuses on designing and managing

Airflow deployments, touching on topics such as security and designing architectures

for several cloud platforms.

Who should read this book

Data Pipelines with Apache Airflow is written both for scientists and engineers who are

looking to develop basic workflows in Airflow, as well as engineers interested in more

advanced topics such as building custom components for Airflow or managing Air-

flow deployments. As Airflow workflows and components are built in Python, we do

expect readers to have intermediate experience with programming in Python (i.e.,

have a good working knowledge of building Python functions and classes, understand-

ing concepts such as *args and **kwargs, etc.). Some experience with Docker is also

beneficial, as most of our code examples are run using Docker (though they can also

be run locally if you wish).

ABOUT THIS BOOK

How this book is organized: A road map

The book consists of four sections that cover a total of 18 chapters.

Part 1 focuses on the basics of Airflow, explaining what Airflow is and outlining its

basic concepts.

■Chapter 1 discusses the concept of data workflows/pipelines and how these can

be built using Apache Airflow. It also discusses the advantages and disadvantages

of Airflow compared to other solutions, including in which situations you might

not want to use Apache Airflow.

■Chapter 2 goes into the basic structure of pipelines in Apache Airflow (also

known as DAGs), explaining the different components involved and how these

fit together.

■Chapter 3 shows how you can use Airflow to schedule your pipelines to run at

recurring time intervals so that you can (for example) build pipelines that

incrementally load new data over time. The chapter also dives into some intrica-

cies in Airflow’s scheduling mechanism, which is often a source of confusion.

■Chapter 4 demonstrates how you can use templating mechanisms in Airflow to

dynamically include variables in your pipeline definitions. This allows you to

reference things such as schedule execution dates within your pipelines.

■Chapter 5 demonstrates different approaches for defining relationships between

tasks in your pipelines, allowing you to build more complex pipeline structures

with branches, conditional tasks, and shared variables.

Part 2 dives deeper into using more complex Airflow topics, including interfacing

with external systems, building your own custom components, and designing tests for

your pipelines.

■Chapter 6 shows how you can trigger workflows in other ways that don’t involve

fixed schedules, such as files being loaded or via an HTTP call.

■Chapter 7 demonstrates workflows using operators that orchestrate various

tasks outside Airflow, allowing you to develop a flow of events through systems

that are not connected.

■Chapter 8 explains how you can build custom components for Airflow that

allow you to reuse functionality across pipelines or integrate with systems that

are not supported by Airflow’s built-in functionality.

■Chapter 9 discusses various options for testing Airflow workflows, touching on

several properties of operators and how to approach these during testing.

■Chapter 10 demonstrates how you can use container-based workflows to run

pipeline tasks within Docker or Kubernetes and discusses the advantages and

disadvantages of these container-based approaches.

ABOUT THIS BOOK xxi

Part 3 focuses on applying Airflow in practice and touches on subjects such as best

practices, running/securing Airflow, and a final demonstrative use case.

■Chapter 11 highlights several best practices to use when building pipelines, which

will help you to design and implement efficient and maintainable solutions.

■Chapter 12 details several topics to account for when running Airflow in a pro-

duction setting, such as architectures for scaling out, monitoring, logging, and

alerting.

■Chapter 13 discusses how to secure your Airflow installation to avoid unwanted

access and to minimize the impact in the case a breach occurs.

■Chapter 14 demonstrates an example Airflow project in which we periodically

process rides from New York City’s Yellow Cab and Citi Bikes to determine the

fastest means of transportation between neighborhoods.

Part 4 explores how to run Airflow in several cloud platforms and includes topics such

as designing Airflow deployments for the different clouds and how to use built-in

operators to interface with different cloud services.

■Chapter 15 provides a general introduction by outlining which Airflow compo-

nents are involved in (cloud) deployments, introducing the idea behind cloud-

specific components built into Airflow, and weighing the options of rolling out

your own cloud deployment versus using a managed solution.

■Chapter 16 focuses on Amazon’s AWS cloud platform, expanding on the previ-

ous chapter by designing deployment solutions for Airflow on AWS and demon-

strating how specific components can be used to leverage AWS services.

■Chapter 17 designs deployments and demonstrates cloud-specific components

for Microsoft’s Azure platform.

■Chapter 18 addresses deployments and cloud-specific components for Google’s

GCP platform.

People new to Airflow should read chapters 1 and 2 to get a good idea of what Airflow

is and what it can do. Chapters 3–5 provide important information about Airflow’s key

functionality. The rest of the book discusses topics such as building custom compo-

nents, testing, best practices, and deployments and can be read out of order, based on

the reader’s particular needs.

About the code

All source code in listings or text is in a

fixed-width

font

this

to separate it

from ordinary text. Sometimes code is also

bold

to highlight code that has changed

from previous steps in the chapter, such as when a new feature adds to an existing line

of code.

In many cases, the original source code has been reformatted; we’ve added line

breaks and reworked indentation to accommodate the available page space in the

book. In rare cases, even this was not enough, and listings include line-continuation

ABOUT THIS BOOK

xxii

markers (➥). Additionally, comments in the source code have often been removed

from the listings when the code is described in the text. Code annotations accompany

many of the listings, highlighting important concepts.

References to elements in the code, scripts, or specific Airflow classes/variables/

values are often in italics to help distinguish them from the surrounding text.

Source code for all examples and instructions to run them using Docker and

Docker Compose are available in our GitHub repository (https://github.com/BasPH/

data-pipelines-with-apache-airflow) and can be downloaded via the book’s website

(www.manning.com/books/data-pipelines-with-apache-airflow).

NOTE Appendix A provides more detailed instructions on running the code

examples.

All code samples have been tested with Airflow 2.0. Most examples should also run on

older versions of Airflow (1.10), with small modifications. Where possible, we have

included inline pointers on how to do so. To help you account for differences in

import paths between Airflow 2.0 and 1.10, appendix B provides an overview of

changed import paths between the two versions.

LiveBook discussion forum

Purchase of Data Pipelines with Apache Airflow includes free access to a private web

forum run by Manning Publications where you can make comments about the book,

ask technical questions, and receive help from the author and other users. To access

the forum and subscribe to it, go to https://livebook.manning.com/#!/book/data-

pipelines-with-apache-airflow/discussion. This page provides information on how to

get on the forum once you’re registered, what kind of help is available, and its rules of

conduct.

Manning’s commitment to our readers is to provide a venue where a meaningful

dialogue between individual readers and between readers and the authors can take

place. It is not a commitment to any specific amount of participation on the part of

the authors, whose contribution to the forum remains voluntary (and unpaid). We

suggest you try asking the authors some challenging questions lest their interest stray!

The forum and the archives of previous discussions will be accessible from the pub-

lisher’s website as long as the book is in print.

xxiii

about the authors

BAS HARENSLAK is a data engineer at GoDataDriven, a company developing data-

driven solutions located in Amsterdam, Netherlands. With a background in software

engineering and computer science, he enjoys working on software and data as if they

are challenging puzzles. He favors working on open source software, is a committer

on the Apache Airflow project, and is co-organizer of the Amsterdam Airflow meetup.

JULIAN DE RUITER is a machine learning engineer with a background in computer and

life sciences and has a PhD in computational cancer biology. As an experienced soft-

ware developer, he enjoys bridging the worlds of data science and engineering by

using cloud and open source software to develop production-ready machine learning

solutions. In his spare time, he enjoys developing his own Python packages, contribut-

ing to open source projects, and tinkering with electronics.

xxiv

about the cover illustration

The figure on the cover of Data Pipelines with Apache Airflow is captioned “Femme de

l’Isle de Siphanto,” or Woman from Island Siphanto. The illustration is taken from a

collection of dress costumes from various countries by Jacques Grasset de Saint-

Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797.

Each illustration is finely drawn and colored by hand. The rich variety of Grasset de

Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns

and regions were just 200 years ago. Isolated from each other, people spoke different

dialects and languages. In the streets or in the countryside, it was easy to identify

where they lived and what their trade or station in life was just by their dress.

The way we dress has changed since then and the diversity by region, so rich at the

time, has faded away. It is now hard to tell apart the inhabitants of different conti-

nents, let alone different towns, regions, or countries. Perhaps we have traded cultural

diversity for a more varied personal life—certainly for a more varied and fast-paced

technological life.

At a time when it is hard to tell one computer book from another, Manning cele-

brates the inventiveness and initiative of the computer business with book covers

based on the rich diversity of regional life of two centuries ago, brought back to life by

Grasset de Saint-Sauveur’s pictures.

Part 1

Getting started

This part of the book will set the stage for your journey into building pipe-

lines for all kinds of wonderful data processes using Apache Airflow. The first

two chapters are aimed at giving you an overview of what Airflow is and what it

can do for you.

First, in chapter 1, we’ll explore the concepts of data pipelines and sketch the

role Apache Airflow plays in helping you implement these pipelines. To set

expectations, we’ll also compare Airflow to several other technologies, and dis-

cuss when it might or might not be a good fit for your specific use case. Next,

chapter 2 will teach you how to implement your first pipeline in Airflow. After

building the pipeline, we’ll also examine how to run this pipeline and monitor

its progress using Airflow’s web interface.

Chapters 3–5 dive deeper into key concepts of Airflow to give you a solid

understanding of Airflow’s underpinnings.

Chapter 3 focuses on scheduling semantics, which allow you to configure Air-

flow to run your pipelines at regular intervals. This lets you (for example) write

pipelines that load and process data efficiently on a daily, weekly, or monthly

basis. Next, in chapter 4, we’ll discuss templating mechanisms in Airflow, which

allow you to dynamically reference variables such as execution dates in your

pipelines. Finally, in chapter 5, we’ll dive into different approaches for defining

task dependencies in your pipelines, which allow you to define complex task

hierarchies, including conditional tasks, branches, and so on.

If you’re new to Airflow, we recommend making sure you understand the

main concepts described in chapters 3–5, as these are key to using it effectively.

2PART 1 Getting started

Airflow’s scheduling semantics (described in chapter 3) can be especially confusing

for new users, as they can be somewhat counterintuitive when first encountered.

After finishing part 1, you should be well-equipped to write your own basic pipe-

lines in Apache Airflow and be ready to dive into some more advanced topics in

parts 2–4.

Meet Apache Airflow

People and companies are continuously becoming more data-driven and are devel-

oping data pipelines as part of their daily business. Data volumes involved in these

business processes have increased substantially over the years, from megabytes per

day to gigabytes per minute. Though handling this data deluge may seem like a

considerable challenge, these increasing data volumes can be managed with the

appropriate tooling.

This book focuses on Apache Airflow, a batch-oriented framework for building

data pipelines. Airflow’s key feature is that it enables you to easily build scheduled

data pipelines using a flexible Python framework, while also providing many building

blocks that allow you to stitch together the many different technologies encountered

in modern technological landscapes.

This chapter covers

Showing how data pipelines can be represented

in workflows as graphs of tasks

Understanding how Airflow fits into the

ecosystem of workflow managers

Determining if Airflow is a good fit for you

4CHAPTER 1 Meet Apache Airflow

Airflow is best thought of as a spider in a web: it sits in the middle of your data pro-

cesses and coordinates work happening across the different (distributed) systems. As

such, Airflow is not a data processing tool in itself but orchestrates the different com-

ponents responsible for processing your data in data pipelines.

In this chapter, we’ll first give you a short introduction to data pipelines in Apache

Airflow. Afterward, we’ll discuss several considerations to keep in mind when evaluat-

ing whether Airflow is right for you and demonstrate how to make your first steps with

Airflow.

1.1 Introducing data pipelines

Data pipelines generally consist of several tasks or actions that need to be executed to

achieve the desired result. For example, say we want to build a small weather dash-

board that tells us what the weather will be like in the coming week (figure 1.1). To

implement this live weather dashboard, we need to perform something like the fol-

lowing steps:

1Fetch weather forecast data from a weather API.

2Clean or otherwise transform the fetched data (e.g., converting temperatures

from Fahrenheit to Celsius or vice versa), so that the data suits our purpose.

3Push the transformed data to the weather dashboard.

As you can see, this relatively simple pipeline already consists of three different tasks

that each perform part of the work. Moreover, these tasks need to be executed in a

specific order, as it (for example) doesn’t make sense to try transforming the data

before fetching it. Similarly, we can’t push any new data to the dashboard until it has

undergone the required transformations. As such, we need to make sure that this

implicit task order is also enforced when running this data process.

1.1.1 Data pipelines as graphs

One way to make dependencies between tasks more explicit is to draw the data pipe-

line as a graph. In this graph-based representation, tasks are represented as nodes in

the graph, while dependencies between tasks are represented by directed edges

between the task nodes. The direction of the edge indicates the direction of the

dependency, with an edge pointing from task A to task B, indicating that task A needs

to be completed before task B can start. Note that this type of graph is generally called

a directed graph, due to the directions in the graph edges.

Applying this graph representation to our weather dashboard pipeline, we can see

that the graph provides a relatively intuitive representation of the overall pipeline

DashboardWeather API

Fetch and

clean data

Figure 1.1 Overview of the weather dashboard

use case, in which weather data is fetched from

an external API and fed into a dynamic dashboard

Introducing data pipelines

(figure 1.2). By just quickly glancing at the graph, we can see that our pipeline con-

sists of three different tasks, each corresponding to one of the tasks outlined. Other

than this, the direction of the edges clearly indicates the order in which the tasks need

to be executed: we can simply follow the arrows to trace the execution.

This type of graph is typically called a directed acyclic graph (DAG), as the graph con-

tains directed edges and does not contain any loops or cycles (acyclic). This acyclic prop-

erty is extremely important, as it prevents us from running into circular dependencies

(figure 1.3) between tasks (where task A depends on task B and vice versa). These cir-

cular dependencies become problematic when trying to execute the graph, as we run

into a situation where task 2 can only execute once task 3 has been completed, while

task 3 can only execute once task 2 has been completed. This logical inconsistency

leads to a deadlock type of situation, in which neither task 2 nor 3 can run, preventing

us from executing the graph.

Note that this representation is different from cyclic graph representations, which can

contain cycles to illustrate iterative parts of algorithms (for example), as are common

Fetch weather for Push data to dashboarecast Clean forecast data d

Task dependencyTask node

Legend

Figure 1.2 Graph representation of the data pipeline for the weather dashboard.

Nodes represent tasks and directed edges represent dependencies between tasks

(with an edge pointing from task A to task B, indicating that task A needs to be run

before task B).

Task 3

Task 2

A dir graph (DAG) of tasksected acyclic

A dir graph of tasksected cyclic

Task 2 will never be able to execute,

due to its dependency on task 3,

which in turn depends on task 2.

Task 1

Figure 1.3 Cycles in graphs prevent

task execution due to circular

dependencies. In acyclic graphs (top),

there is a clear path to execute the

three different tasks. However, in

cyclic graphs (bottom), there is no

longer a clear execution path due

to the interdependency between

tasks 2 and 3.

6CHAPTER 1 Meet Apache Airflow

in many machine learning applications. However, the acyclic property of DAGs is used

by Airflow (and many other workflow managers) to efficiently resolve and execute these

graphs of tasks.

1.1.2 Executing a pipeline graph

A nice property of this DAG representation is that it provides a relatively straightfor-

ward algorithm that we can use for running the pipeline. Conceptually, this algorithm

consists of the following steps:

1For each open (= uncompleted) task in the graph, do the following:

– For each edge pointing toward the task, check if the “upstream” task on the

other end of the edge has been completed.

– If all upstream tasks have been completed, add the task under consideration

to a queue of tasks to be executed.

2Execute the tasks in the execution queue, marking them completed once they

finish performing their work.

3Jump back to step 1 and repeat until all tasks in the graph have been completed.

To see how this works, let’s trace through a small execution of our dashboard pipeline

(figure 1.4). On our first loop through the steps of our algorithm, we see that the clean

and push tasks still depend on upstream tasks that have not yet been completed. As

such, the dependencies of these tasks have not been satisfied, so at this point they can’t

be added to the execution queue. However, the fetch task does not have any incoming

edges, meaning that it does not have any unsatisfied upstream dependencies and can

therefore be added to the execution queue.

After completing the fetch task, we can start the second loop by examining the

dependencies of the clean and push tasks. Now we see that the clean task can be exe-

cuted as its upstream dependency (the fetch task) has been completed. As such, we can

add the task to the execution queue. The push task can’t be added to the queue, as it

depends on the clean task, which we haven’t run yet.

In the third loop, after completing the clean task, the push task is finally ready for

execution as its upstream dependency on the clean task has now been satisfied. As a

result, we can add the task to the execution queue. After the push task has finished

executing, we have no more tasks left to execute, thus finishing the execution of the

overall pipeline.

1.1.3 Pipeline graphs vs. sequential scripts

Although the graph representation of a pipeline provides an intuitive overview of the

tasks in the pipeline and their dependencies, you may find yourself wondering why we

wouldn’t just use a simple script to run this linear chain of three steps. To illustrate

some advantages of the graph-based approach, let’s jump to a slightly bigger example.

In this new use case, we’ve been approached by the owner of an umbrella company,

Introducing data pipelines

who was inspired by our weather dashboard and would like to try to use machine

learning (ML) to increase the efficiency of their operation. To do so, the company

owner would like us to implement a data pipeline that creates an ML model correlat-

ing umbrella sales with weather patterns. This model can then be used to predict how

much demand there will be for the company’s umbrellas in the coming weeks, depend-

ing on the weather forecasts for those weeks (figure 1.5).

To build a pipeline for training the ML model, we need to implement something

like the following steps:

1Prepare the sales data by doing the following:

– Fetching the sales data from the source system

– Cleaning/transforming the sales data to fit requirements

Fetch weather forecast Clean forecast data dPush data to dashboar

Legend

Unsatisﬁed dependencyOpen task

Completed task enSatisﬁed depend cy

Fetch weather forecast Clean forecast data Push data to dashboard

Loop 1

Loop 2

Fetch weather forecast Push data to dashboard

Loop 3

End state

Fetch weather forecast

Task ready for execution;

no unsatisﬁed dependencies

Task has ﬁnished

execution

Task now ready for execution,

as its upstream dependency

is satisﬁed

Not ready for execution

yet; still has unsatisﬁed

dependencies

Task has ﬁnished

execution

Task now ready

for execution

Clean forecast data

Push data to dashboardClean forecast data

Figure 1.4 Using the DAG structure to execute tasks in the data pipeline in the correct order:

depicts each task’s state during each of the loops through the algorithm, demonstrating how this

leads to the completed execution of the pipeline (end state)

8CHAPTER 1 Meet Apache Airflow

2Prepare the weather data by doing the following:

– Fetching the weather forecast data from an API

– Cleaning/transforming the weather data to fit requirements

3Combine the sales and weather data sets to create the combined data set that

can be used as input for creating a predictive ML model.

4Train the ML model using the combined data set.

5Deploy the ML model so that it can be used by the business.

This pipeline can be represented using the same graph-based representation that we

used before, by drawing tasks as nodes and data dependencies between tasks as edges.

One important difference from our previous example is that the first steps of this

pipeline (fetching and clearing the weather/sales data) are in fact independent of

each other, as they involve two separate data sets. This is clearly illustrated by the two

separate branches in the graph representation of the pipeline (figure 1.6), which can

be executed in parallel if we apply our graph execution algorithm, making better use

of available resources and potentially decreasing the running time of a pipeline com-

pared to executing the tasks sequentially.

Weather API

Sales data

Combine

data sets

Train

model

Predict

umbrella sales

Fetch and

clean data

Fetch and

clean data

Figure 1.5 Overview of the umbrella demand use case, in which historical weather and sales

data are used to train a model that predicts future sales demands depending on weather

forecasts

Fetch weather

forecast

Clean

forecast data

Join data sets

Fetch sales

data

Clean sales

data

Train ML

model

Deploy ML

model

Figure 1.6 Independence between sales and weather tasks in the graph representation of

the data pipeline for the umbrella demand forecast model. The two sets of fetch/cleaning tasks

are independent as they involve two different data sets (the weather and sales data sets).

This independence is indicated by the lack of edges between the two sets of tasks.

Introducing data pipelines

Another useful property of the graph-based representation is that it clearly separates

pipelines into small incremental tasks rather than having one monolithic script or

process that does all the work. Although having a single monolithic script may not ini-

tially seem like that much of a problem, it can introduce some inefficiencies when

tasks in the pipeline fail, as we would have to rerun the entire script. In contrast, in the

graph representation, we need only to rerun any failing tasks (and any downstream

dependencies).

1.1.4 Running pipeline using workflow managers

Of course, the challenge of running graphs of dependent tasks is hardly a new prob-

lem in computing. Over the years, many so-called “workflow management” solutions

have been developed to tackle this problem, which generally allow you to define and

execute graphs of tasks as workflows or pipelines.

Some well-known workflow managers you may have heard of include those listed

in table 1.1.

Although each of these workflow managers has its own strengths and weaknesses, they

all provide similar core functionality that allows you to define and run pipelines con-

taining multiple tasks with dependencies.

One of the key differences between these tools is how they define their workflows.

For example, tools such as Oozie use static (XML) files to define workflows, which

provides legible workflows but limited flexibility. Other solutions such as Luigi and

Table 1.1 Overview of several well-known workflow managers and their key characteristics.

Name Originated

ata

a. Some tools were originally created by (ex-)employees of a company; however, all tools are open sourced and not represented by one

single company.

Workflows

defined in Written in Scheduling Backfilling User

interfaceb

b. The quality and features of user interfaces vary widely.

Installation

platform

Horizontally

scalable

Airflow Airbnb Python Python Yes Yes Yes Anywhere Yes

Argo Applatix YAML Go Third partyc

c. https://github.com/bitphy/argo-cron.

Yes Kubernetes Yes

Azkaban LinkedIn YAML Java Yes No Yes Anywhere

Conductor Netflix JSON Java No Yes Anywhere Yes

Luigi Spotify Python Python No Yes Yes Anywhere Yes

Make Custom

DSL

C No No No Anywhere No

Metaflow Netflix Python Python No No Anywhere Yes

Nifi NSA UI Java Yes No Yes Anywhere Yes

Oozie XML Java Yes Yes Yes Hadoop Yes

10 CHAPTER 1 Meet Apache Airflow

Airflow allow you to define workflows as code, which provides greater flexibility but

can be more challenging to read and test (depending on the coding skills of the per-

son implementing the workflow).

Other key differences lie in the extent of features provided by the workflow man-

ager. For example, tools such as Make and Luigi do not provide built-in support for

scheduling workflows, meaning that you’ll need an extra tool like Cron if you want to

run your workflow on a recurring schedule. Other tools may provide extra functional-

ity such as scheduling, monitoring, user-friendly web interfaces, and so on built into

the platform, meaning that you don’t have to stitch together multiple tools yourself

to get these features.

All in all, picking the right workflow management solution for your needs will

require some careful consideration of the key features of the different solutions and

how they fit your requirements. In the next section, we’ll dive into Airflow—the focus

of this book—and explore several key features that make it particularly suited for

handling data-oriented workflows or pipelines.

1.2 Introducing Airflow

In this book, we focus on Airflow, an open source solution for developing and moni-

toring workflows. In this section, we’ll provide a helicopter view of what Airflow does,

after which we’ll jump into a more detailed examination of whether it is a good fit for

your use case.

1.2.1 Defining pipelines flexibly in (Python) code

Similar to other workflow managers, Airflow allows you to define pipelines or work-

flows as DAGs of tasks. These graphs are very similar to the examples sketched in the

previous section, with tasks being defined as nodes in the graph and dependencies as

directed edges between the tasks.

In Airflow, you define your DAGs using Python code in DAG files, which are essen-

tially Python scripts that describe the structure of the corresponding DAG. As such,

each DAG file typically describes the set of tasks for a given DAG and the dependen-

cies between the tasks, which are then parsed by Airflow to identify the DAG structure

(figure 1.7). Other than this, DAG files typically contain some additional metadata

about the DAG telling Airflow how and when it should be executed, and so on. We’ll

dive into this scheduling more in the next section.

One advantage of defining Airflow DAGs in Python code is that this programmatic

approach provides you with a lot of flexibility for building DAGs. For example, as we

will see later in this book, you can use Python code to dynamically generate optional

tasks depending on certain conditions or even generate entire DAGs based on exter-

nal metadata or configuration files. This flexibility gives a great deal of customization

in how you build your pipelines, allowing you to fit Airflow to your needs for building

arbitrarily complex pipelines.

Introducing Airflow

In addition to this flexibility, another advantage of Airflow’s Python foundation is that

tasks can execute any operation that you can implement in Python. Over time, this has

led to the development of many Airflow extensions that enable you to execute tasks

across a wide variety of systems, including external databases, big data technologies,

and various cloud services, allowing you to build complex data pipelines bringing

together data processes across many different systems.

1.2.2 Scheduling and executing pipelines

Once you’ve defined the structure of your pipeline(s) as DAG(s), Airflow allows you

to define a schedule interval for each DAG, which determines exactly when your pipe-

line is run by Airflow. This way, you can tell Airflow to execute your DAG every hour,

every day, every week, and so on, or even use more complicated schedule intervals

based on Cron-like expressions.

To see how Airflow executes your DAGs, let’s briefly look at the overall process

involved in developing and running Airflow DAGs. At a high level, Airflow is orga-

nized into three main components (figure 1.8):

The Airflow scheduler—Parses DAGs, checks their schedule interval, and (if the

DAGs’ schedule has passed) starts scheduling the DAGs’ tasks for execution by

passing them to the Airflow workers.

The Airflow workers—Pick up tasks that are scheduled for execution and execute

them. As such, the workers are responsible for actually “doing the work.”

The Airflow webserver —Visualizes the DAGs parsed by the scheduler and provides

the main interface for users to monitor DAG runs and their results.

Pipeline as DAG

Task 1

Task 2

Task 3

Task 4

Schedule interval = @daily

DAG ﬁle

(Python)

Dependency between tasks,

indicating task 3 must run

before task 4

Which schedule to use

for running the DAG

Represents a

task/operation

we want to run

Figure 1.7 Airflow pipelines are defined as DAGs using Python code in DAG files. Each

DAG file typically defines one DAG, which describes the different tasks and their

dependencies. Besides this, the DAG also defines a schedule interval that determines

when the DAG is executed by Airflow.

12 CHAPTER 1 Meet Apache Airflow

The heart of Airflow is arguably the scheduler, as this is where most of the magic hap-

pens that determines when and how your pipelines are executed. At a high level, the

scheduler runs through the following steps (figure 1.9):

1Once users have written their workflows as DAGs, the files containing these

DAGs are read by the scheduler to extract the corresponding tasks, dependen-

cies, and schedule interval of each DAG.

2For each DAG, the scheduler then checks whether the schedule interval for the

DAG has passed since the last time it was read. If so, the tasks in the DAG are

scheduled for execution.

3For each scheduled task, the scheduler then checks whether the dependencies

(= upstream tasks) of the task have been completed. If so, the task is added to

the execution queue.

4The scheduler waits for several moments before starting a new loop by jumping

back to step 1.

The astute reader might have noticed that the steps followed by the scheduler are, in

fact, very similar to the algorithm introduced in section 1.1. This is not by accident, as

DAG ﬁles describing

pipelines (in Python)

DAG folder

Airﬂow

scheduler

Airﬂow

webserver Airﬂow

workers

Queue

Users

Monitor DAG

runs + results

Visualize DAGs

+ task results

Store task

results

Airﬂow metastore

(database)

Read

DAGs

Execute

tasks

Store

serialized

DAGs

Write data workﬂows

in Python as Airﬂow DAGs

Schedule

tasks

Figure 1.8 Overview of the main components involved in Airflow (e.g., the Airflow webserver,

scheduler, and workers)

Introducing Airflow

Airflow is essentially following the same steps, adding some extra logic on top to handle

its scheduling logic.

Once tasks have been queued for execution, they are picked up by a pool of Air-

flow workers that execute tasks in parallel and track their results. These results are

communicated to Airflow’s metastore so that users can track the progress of tasks and

view their logs using the Airflow web interface (provided by the Airflow webserver).

1.2.3 Monitoring and handling failures

In addition to scheduling and executing DAGs, Airflow also provides an extensive web

interface that can be used for viewing DAGs and monitoring the results of DAG runs.

After you log in (figure 1.10), the main page provides an extensive overview of the dif-

ferent DAGs with summary views of their recent results (figure 1.11).

For example, the graph view of an individual DAG provides a clear overview of the

DAG’s tasks and dependencies (figure 1.12), similar to the schematic overviews we’ve

been drawing in this chapter. This view is particularly useful for viewing the structure

of a DAG (providing detailed insight into dependencies between tasks), and for view-

ing the results of individual DAG runs.

Airﬂow worker

Execute task

Retrieve

task results

Execution

queue

Airﬂow metastore

(database)

1. User writes

workﬂow

as DAG.

DAG ﬁle

(Python)

3. Airﬂow workers

execute scheduled

tasks.

Airﬂow

webserver

2. Airﬂow scheduler parses DAG and schedules

tasks according to DAG schedule, accounting

for dependencies between tasks.

Airﬂow scheduler

Read DAGs from ﬁles

(tasks, dependencies +

schedule interval)

If DAG schedule

has passed,

schedule DAG tasks

Check task

dependencies

Wait X seconds

Add task to

execution queue

For each

scheduled task

If dependencies of

task are satisﬁed

User

Store

task results

Retrieve DAGs

+ task results

Store

serialized

DAGs

4. User monitors

execution + task

results using web

interface.

Figure 1.9 Schematic overview of the process involved in developing and executing pipelines as DAGs using

Airflow

14 CHAPTER 1 Meet Apache Airflow

Your username +

password

Figure 1.10 The login page for the Airflow web interface. In the code examples accompanying this

book, a default user “admin” is provided with the password “admin.”

Names of

registered

workﬂows

State of workﬂow tasks

from recent runs

Workﬂow

schedules

Figure 1.11 The main page of Airflow’s web interface, showing an overview of the available DAGs

and their recent results

Introducing Airflow

Besides this graph view, Airflow also provides a detailed tree view that shows all run-

ning and historical runs for the corresponding DAG (figure 1.13). This is arguably the

most powerful view provided by the web interface, as it gives you a quick overview of

how a DAG has performed over time and allows you to dig into failing tasks to see

what went wrong.

By default, Airflow can handle failures in tasks by retrying them a couple of times

(optionally with some wait time in between), which can help tasks recover from any

intermittent failures. If retries don’t help, Airflow will record the task as being failed,

optionally notifying you about the failure if configured to do so. Debugging task fail-

ures is pretty straightforward, as the tree view allows you to see which tasks failed and

dig into their logs. The same view also enables you to clear the results of individual

tasks to rerun them (together with any tasks that depend on that task), allowing you to

easily rerun any tasks after you make changes to their code.

1.2.4 Incremental loading and backfilling

One powerful feature of Airflow’s scheduling semantics is that the schedule intervals

not only trigger DAGs at specific time points (similar to, for example, Cron), but also

provide details about the last and (expected) next schedule intervals. This essentially

Figure 1.12 The graph view in Airflow’s web interface, showing an overview of the tasks in an individual DAG

and the dependencies between these tasks

16 CHAPTER 1 Meet Apache Airflow

allows you to divide time into discrete intervals (e.g., every day, week, etc.), and run

your DAG for each of these intervals.1

This property of Airflow’s schedule intervals is invaluable for implementing effi-

cient data pipelines, as it allows you to build incremental data pipelines. In these

incremental pipelines, each DAG run processes only data for the corresponding time

slot (the data’s delta) instead of having to reprocess the entire data set every time.

Especially for larger data sets, this can provide significant time and cost benefits by

avoiding expensive recomputation of existing results.

Schedule intervals become even more powerful when combined with the concept

of backfilling, which allows you to execute a new DAG for historical schedule intervals

that occurred in the past. This feature allows you to easily create (or backfill) new data

sets with historical data simply by running your DAG for these past schedule intervals.

Moreover, by clearing the results of past runs, you can also use this Airflow feature to

easily rerun any historical tasks if you make changes to your task code, allowing you

to easily reprocess an entire data set when needed.

1If this sounds a bit abstract to you now, don’t worry, as we provide more detail on these concepts later in the

book.

State of a

single task

All runs of

one task

One run of

a workﬂow

Figure 1.13 Airflow’s tree view, showing the results of multiple runs of the umbrella sales model DAG

(most recent + historical runs). The columns show the status of one execution of the DAG and the rows

show the status of all executions of a single task. Colors (which you can see in the e-book version)

indicate the result of the corresponding task. Users can also click on the task “squares” for more

details about a given task instance, or to reset the state of a task so that it can be rerun by Airflow,

if desired.

When to use Airflow

1.3 When to use Airflow

After this brief introduction to Airflow, we hope you’re sufficiently enthusiastic about

getting to know Airflow and learning more about its key features. However, before

going any further, we’ll first explore several reasons you might want to choose to work

with Airflow (as well as several reasons you might not), to ensure that Airflow is

indeed the best fit for you.

1.3.1 Reasons to choose Airflow

In the past sections, we’ve already described several key features that make Airflow

ideal for implementing batch-oriented data pipelines. In summary, these include the

following:

The ability to implement pipelines using Python code allows you to create arbi-

trarily complex pipelines using anything you can dream up in Python.

The Python foundation of Airflow makes it easy to extend and add integrations

with many different systems. In fact, the Airflow community has already devel-

oped a rich collection of extensions that allow Airflow to integrate with many

different types of databases, cloud services, and so on.

Rich scheduling semantics allow you to run your pipelines at regular intervals

and build efficient pipelines that use incremental processing to avoid expensive

recomputation of existing results.

Features such as backfilling enable you to easily (re)process historical data,

allowing you to recompute any derived data sets after making changes to your

code.

Airflow’s rich web interface provides an easy view for monitoring the results of

your pipeline runs and debugging any failures that may have occurred.

An additional advantage of Airflow is that it is open source, which guarantees that you

can build your work on Airflow without getting stuck with any vendor lock-in. Man-

aged Airflow solutions are also available from several companies (should you desire

some technical support), giving you a lot of flexibility in how you run and manage

your Airflow installation.

1.3.2 Reasons not to choose Airflow

Although Airflow has many rich features, several of Airflow’s design choices may make

it less suitable for certain cases. For example, some use cases that are not a good fit for

Airflow include the following:

Handling streaming pipelines, as Airflow is primarily designed to run recurring

or batch-oriented tasks, rather than streaming workloads.

Implementing highly dynamic pipelines, in which tasks are added/removed

between every pipeline run. Although Airflow can implement this kind of

dynamic behavior, the web interface will only show tasks that are still defined in

18 CHAPTER 1 Meet Apache Airflow

the most recent version of the DAG. As such, Airflow favors pipelines that do

not change in structure every time they run.

Teams with little or no (Python) programming experience, as implementing

DAGs in Python can be daunting with little Python experience. In such teams,

using a workflow manager with a graphical interface (such as Azure Data Fac-

tory) or a static workflow definition may make more sense.

Similarly, Python code in DAGs can quickly become complex for larger use

cases. As such, implementing and maintaining Airflow DAGs require proper

engineering rigor to keep things maintainable in the long run.

Also, Airflow is primarily a workflow/pipeline management platform and does not

(currently) include more extensive features such as maintaining data lineages, data

versioning, and so on. Should you require these features, you’ll probably need to look

at combining Airflow with other specialized tools that provide those capabilities.

1.4 The rest of this book

By now you should (hopefully) have a good idea of what Airflow is and how its fea-

tures can help you implement and run data pipelines. In the remainder of this book,

we’ll begin by introducing the basic components of Airflow that you need to be famil-

iar with to start building your own data pipelines. These first few chapters should be

broadly applicable and appeal to a wide audience. For these chapters, we expect you

to have intermediate experience with programming in Python (~one year of experi-

ence), meaning that you should be familiar with basic concepts such as string format-

ting, comprehensions, args/kwargs, and so on. You should also be familiar with the

basics of the Linux terminal and have a basic working knowledge of databases (includ-

ing SQL) and different data formats.

After this introduction, we’ll dive deeper into more advanced features of Airflow

such as generating dynamic DAGs, implementing your own operators, running con-

tainerized tasks, and so on. These chapters will require some more understanding of

the involved technologies, including writing your own Python classes, basic Docker

concepts, file formats, and data partitioning. We expect this second part to be of spe-

cial interest to the data engineers in the audience.

Finally, several chapters toward the end of the book focus on topics surrounding

the deployment of Airflow, including deployment patterns, monitoring, security, and

cloud architectures. We expect these chapters to be of special interest for people

interested in rolling out and managing Airflow deployments, such as system adminis-

trators and DevOps engineers.

Summary

Data pipelines can be represented as DAGs, which clearly define tasks and their

dependencies. These graphs can be executed efficiently, taking advantage of

any parallelism inherent in the dependency structure.

Summary

Although many workflow managers have been developed over the years for exe-

cuting graphs of tasks, Airflow has several key features that makes it uniquely

suited for implementing efficient, batch-oriented data pipelines.

Airflow consists of three core components: the webserver, the scheduler, and

the worker processes, which work together to schedule tasks from your data

pipelines and help you monitor their results.

Anatomy of

an Airflow DAG

In the previous chapter, we learned why working with data and the many tools in the

data landscape is not easy. In this chapter, we get started with Airflow and check out

an example workflow that uses basic building blocks found in many workflows.

It helps to have some Python experience when starting with Airflow since work-

flows are defined in Python code. The gap in learning the basics of Airflow is not

that big. Generally, getting the basic structure of an Airflow workflow up and run-

ning is easy. Let’s dig into a use case of a rocket enthusiast to see how Airflow might

help him.

This chapter covers

Running Airflow on your own machine

Writing and running your first workflow

Examining the first view at the Airflow interface

Handling failed tasks in Airflow

Collecting data from numerous sources

2.1 Collecting data from numerous sources

Rockets are one of humanity’s engineering marvels, and every rocket launch attracts

attention all around the world. In this chapter, we cover the life of a rocket enthusiast

named John who tracks and follows every single rocket launch. The news about rocket

launches is found in many news sources that John keeps track of, and, ideally, John

would like to have all his rocket news aggregated in a single location. John recently

picked up programming and would like to have some sort of automated way to collect

information of all rocket launches and eventually some sort of personal insight into

the latest rocket news. To start small, John decided to first collect images of rockets.

2.1.1 Exploring the data

For the data, we make use of the Launch Library 2 (https://thespacedevs.com/llapi),

an online repository of data about both historical and future rocket launches from

various sources. It is a free and open API for anybody on the planet (subject to rate

limits).

John is currently only interested in upcoming rocket launches. Luckily, the Launch

Library provides exactly the data he is looking for (https://ll.thespacedevs.com/2.0.0/

launch/upcoming). It provides data about upcoming rocket launches, together with

URLs of where to find images of the respective rockets. Here’s a snippet of the data

this URL returns.

$ curl -L "https://ll.thespacedevs.com/2.0.0/launch/upcoming"

{

...

"results": [

{

"id": "528b72ff-e47e-46a3-b7ad-23b2ffcec2f2",

"url": "https://.../528b72ff-e47e-46a3-b7ad-23b2ffcec2f2/",

"launch_library_id": 2103,

"name": "Falcon 9 Block 5 | NROL-108",

"net": "2020-12-19T14:00:00Z",

"window_end": "2020-12-19T17:00:00Z",

"window_start": "2020-12-19T14:00:00Z",

➥ "image": "https://spacelaunchnow-prod-

east.nyc3.digitaloceanspaces.com/media/launch_images/falcon2520925_image

_20201217060406.jpeg",

"infographic": ".../falcon2520925_infographic_20201217162942.png",

...

{

"id": "57c418cc-97ae-4d8e-b806-bb0e0345217f",

"url": "https://.../57c418cc-97ae-4d8e-b806-bb0e0345217f/",

"launch_library_id": null,

"name": "Long March 8 | XJY-7 & others",

"net": "2020-12-22T04:29:00Z",

"window_end": "2020-12-22T05:03:00Z",

Listing 2.1 Example curl request and response to the Launch Library API

Inspect the URL

response with curl from

the command line.

The response is a JSON document,

as you can see by the structure.

The

square

brackets

indicate

a list.

All values

within

these curly

braces

refer to

one single

rocket

launch.

Here we see information such

as rocket ID and start and end

time of the rocket launch

window.

A URL to an

image of the

launching

rocket

22 CHAPTER 2 Anatomy of an Airflow DAG

"window_start": "2020-12-22T04:29:00Z",

"image": "https://.../long2520march_image_20201216110501.jpeg",

"infographic": null,

...

]

}

As you can see, the data is in JSON format and provides rocket launch information,

and for every launch, there’s information about the specific rocket, such as ID, name,

and the image URL. This is exactly what John needs, and he initially draws the plan in

figure 2.1 to collect the images of upcoming rocket launches (e.g., to point his screen-

saver to the directory holding these images):

Based on the example in figure 2.1, we can see that, at the end of the day, John’s goal

is to have a directory filled with rocket images, such as the image in figure 2.2 of the

Ariane 5 ECA rocket.

2.2 Writing your first Airflow DAG

John’s use case is nicely scoped, so let’s check out how to program his plan. It’s only a

few steps and, in theory, with some Bash-fu, you could work it out in a one-liner. So

why would we need a system like Airflow for this job?

The nice thing about Airflow is that we can split a large job, which consists of one

or more steps, into individual “tasks” that together form a DAG. Multiple tasks can be

run in parallel, and tasks can run different technologies. For example, we could first

run a Bash script and next run a Python script. We broke down John’s mental model

of his workflow into three logical tasks in Airflow in figure 2.3.

Why these three tasks, you might ask? Why not download the launches and corre-

sponding pictures in one single task? Or why not split them into five tasks? After all,

John

Launch library

John’s

computer

John’s

computer

Internet

Notiﬁcation

system

Notify

Fetch next

launches

Save rocket

launches

Fetch rocket

pictures

Save rocket

pictures

Figure 2.1 John’s mental model of downloading rocket pictures

Writing your first Airflow DAG

Figure 2.2 Example image

of the Ariane 5 ECA rocket

John

Launch library

John’s

computer

John’s

computer

Internet

Notiﬁcation

system

Fetch next

launches

Save rocket

launches Fetch rocket

pictures

Save rocket

pictures

Notify

Figure 2.3 John’s mental model mapped to tasks in Airflow

24 CHAPTER 2 Anatomy of an Airflow DAG

we have five arrows in John’s plan. These are all valid questions to ask while develop-

ing a workflow, but the truth is, there’s no right or wrong answer. There are several

points to take into consideration, though, and throughout this book we work out

many of these use cases to get a feeling for what is right and wrong. The code for this

workflow is as follows.

import json

import pathlib

import airflow

import requests

import requests.exceptions as requests_exceptions

from airflow import DAG

from airflow.operators.bash import BashOperator

from airflow.operators.python import PythonOperator

dag = DAG(

dag_id="download_rocket_launches",

start_date=airflow.utils.dates.days_ago(14),

schedule_interval=None,

)

download_launches = BashOperator(

task_id="download_launches",

bash_command="curl -o /tmp/launches.json -L

'https://ll.thespacedevs.com/2.0.0/launch/upcoming'",

dag=dag,

)

def _get_pictures():

# Ensure directory exists

pathlib.Path("/tmp/images").mkdir(parents=True, exist_ok=True)

# Download all pictures in launches.json

with open("/tmp/launches.json") as f:

launches = json.load(f)

image_urls = [launch["image"] for launch in launches["results"]]

for image_url in image_urls:

try:

response = requests.get(image_url)

image_filename = image_url.split("/")[-1]

target_file = f"/tmp/images/{image_filename}"

with open(target_file, "wb") as f:

f.write(response.content)

print(f"Downloaded {image_url} to {target_file}")

except requests_exceptions.MissingSchema:

print(f"{image_url} appears to be an invalid URL.")

except requests_exceptions.ConnectionError:

print(f"Could not connect to {image_url}.")

Listing 2.2 DAG for downloading and processing rocket launch data

Instantiate a DAG object;

this is the starting point

of any workflow.

The

name of

the DAG

The date at which the

DAG should first start

running

At what

interval

the DAG

should

run

Apply Bash to download the

URL response with curl.

The name of the task

A Python function will parse

the response and download

all rocket pictures.

Writing your first Airflow DAG

get_pictures = PythonOperator(

task_id="get_pictures",

python_callable=_get_pictures,

dag=dag,

)

notify = BashOperator(

task_id="notify",

bash_command='echo "There are now $(ls /tmp/images/ | wc -l) images."',

dag=dag,

)

download_launches >> get_pictures >> notify

Let’s break down the workflow. The DAG is the starting point of any workflow. All

tasks within the workflow reference this DAG object so that Airflow knows which tasks

belong to which DAG.

dag = DAG(

dag_id="download_rocket_launches",

start_date=airflow.utils.dates.days_ago(14),

schedule_interval=None,

)

Note the (lowercase)

dag

is the name assigned to the instance of the (uppercase)

DAG

class. The instance name could have any name; you can name it

rocket_dag

whatever_name_you_like

. We will reference the variable (lowercase

dag

) in all opera-

tors, which tells Airflow which DAG the operator belongs to.

Also note we set

schedule_interval

None

. This means the DAG will not run

automatically. For now, you can trigger it manually from the Airflow UI. We will get to

scheduling in section 2.4.

Next, an Airflow workflow script consists of one or more operators, which perform

the actual work. In listing 2.4, we apply the

BashOperator

to run a Bash command.

download_launches = BashOperator(

task_id="download_launches",

bash_command="curl -o /tmp/launches.json 'https://

ll.thespacedevs.com/2.0.0/launch/upcoming'",

dag=dag,

)

Each operator performs a single unit of work, and multiple operators together form a

workflow or DAG in Airflow. Operators run independently of each other, although

you can define the order of execution, which we call dependencies in Airflow. After all,

Listing 2.3 Instantiating a DAG object

Listing 2.4 Instantiating a BashOperator to run a Bash command

Call the Python function in the

DAG with a PythonOperator.

Set the order of

execution of tasks.

The DAG class takes two

required arguments.

The name of the DAG

displayed in the Airflow

user interface (UI)

The datetime at which

the workflow should

first start running

The name of

the task

The Bash command

to execute

Reference to the

DAG variable

26 CHAPTER 2 Anatomy of an Airflow DAG

John’s workflow wouldn’t be useful if you first tried downloading pictures while there

is no data about the location of the pictures. To make sure the tasks run in the correct

order, we can set dependencies between tasks.

download_launches >> get_pictures >> notify

In Airflow, we can use the binary right shift operator (i.e., “rshift” [>>]) to define depen-

dencies between tasks. This ensures the

get_pictures

task runs only after

download

_launches

has completed successfully, and the

notify

task runs only after

get_pictures

has completed successfully.

NOTE In Python, the

rshift

operator

(>>)

is used to shift bits, which is a

common operation in, for example, cryptography libraries. In Airflow, there

is no use case for bit shifting, and the

rshift

operator

was overridden to pro-

vide a readable way to define dependencies between tasks.

2.2.1 Tasks vs. operators

You might wonder what the difference is between tasks and operators. After all, they

both execute a bit of code. In Airflow, operators have a single piece of responsibility:

they exist to perform one single piece of work. Some operators perform generic work,

such as the

BashOperator

(used to run a Bash script) or the

PythonOperator

(used to

run a Python function); others have more specific use cases, such as the

EmailOperator

(used to send an email) or the

SimpleHTTPOperator

(used to call an HTTP endpoint).

Either way, they perform a single piece of work.

The role of a DAG is to orchestrate the execution of a collection of operators. That

includes the starting and stopping of operators, starting consecutive tasks once an

operator is done, ensuring dependencies between operators are met, and so on.

In this context and throughout the Airflow documentation, we see the terms opera-

tor and task used interchangeably. From a user’s perspective, they refer to the same

thing, and the two often substitute each other in discussions. Operators provide the

implementation of a piece of work. Airflow has a class called

BaseOperator

and many

subclasses inheriting from the

BaseOperator

, such as

PythonOperator

EmailOperator

and

OracleOperator

There is a difference, though. Tasks in Airflow manage the execution of an oper-

ator; they can be thought of as a small wrapper or manager around an operator that

ensures the operator executes correctly. The user can focus on the work to be done

by using operators, while Airflow ensures correct execution of the work via tasks

(figure 2.4).

Listing 2.5 Defining the order of task execution

Arrows set the order

of execution of tasks.

Writing your first Airflow DAG

2.2.2 Running arbitrary Python code

Fetching the data for the next rocket launches was a single curl command in Bash,

which is easily executed with the

BashOperator

. However, parsing the JSON result,

selecting the image URLs from it, and downloading the respective images require a bit

more effort. Although all this is still possible in a Bash one-liner, it’s often easier and

more readable with a few lines of Python or any other language of your choice. Since

Airflow code is defined in Python, it’s convenient to keep both the workflow and exe-

cution logic in the same script. For downloading the rocket pictures, we implemented

listing 2.6.

def _get_pictures():

# Ensure directory exists

pathlib.Path("/tmp/images").mkdir(parents=True, exist_ok=True)

# Download all pictures in launches.json

with open("/tmp/launches.json") as f:

launches = json.load(f)

image_urls = [launch["image"] for launch in launches["results"]]

for image_url in image_urls:

try:

response = requests.get(image_url)

image_filename = image_url.split("/")[-1]

target_file = f"/tmp/images/{image_filename}"

with open(target_file, "wb") as f:

f.write(response.content)

print(f"Downloaded {image_url} to {target_file}")

except requests_exceptions.MissingSchema:

print(f"{image_url} appears to be an invalid URL.")

except requests_exceptions.ConnectionError:

print(f"Could not connect to {image_url}.")

get_pictures = PythonOperator(

task_id="get_pictures",

python_callable=_get_pictures,

dag=dag,

)

Listing 2.6 Running a Python function using the PythonOperator

DAG

Task

Operator

Task

Operator

Task

Operator

Figure 2.4 DAGs and operators are used by Airflow users. Tasks are

internal components to manage operator state and display state

changes (e.g., started/finished) to the user.

Python function to call

Create pictures directory

if it doesn’t exist.

Open the result from

the previous task.

Download

each image.

Store each

image.

Print to stdout;

this will be

captured in

Airflow logs.

Instantiate a PythonOperator

to call the Python function.

Point to the Python

function to execute.

28 CHAPTER 2 Anatomy of an Airflow DAG

The

PythonOperator

in Airflow is responsible for running any Python code. Just like

the

BashOperator

used before, this and all other operators require a

task_id

. The

task_id

is referenced when running a task and displayed in the UI. The use of a

PythonOperator

is always twofold:

1We define the operator itself (

get_pictures

2The

python_callable

argument points to a callable, typically a function

(_get_pictures).

When running the operator, the Python function is called and will execute the func-

tion. Let’s break it down. The basic usage of the

PythonOperator

always looks like fig-

ure 2.5.

Although not required, for convenience we keep the variable name

get_pictures

equal

to the

task_id

# Ensure directory exists

pathlib.Path("/tmp/images").mkdir(parents=True, exist_ok=True)

The first step in the callable is to ensure the directory in which the images will be

stored exists, as shown in listing 2.7. Next, we open the result downloaded from the

Launch Library API and extract the image URLs for every launch.

with open("/tmp/launches.json") as f:

launches = json.load(f)

image_urls = [launch["image"] for launch in launches["results"]]

Each image URL is called to download the image and save it in /tmp/images.

Listing 2.7 Ensures that the output directory exists and creates it if it doesn’t

Listing 2.8 Extracts image URLs for every rocket launch

PythonOperator

PythonOperator callable

def _get_pictures():

# do work here ...

get_pictures = PythonOperator(

task_id="get_pictures",

python_callable =_get_pictures,

dag=dag

)

Figure 2.5 The python_callable argument in the

PythonOperator points to a function to execute.

Open the rocket launches’ JSON.

Read as a dict so we

can mingle the data.

For every launch, fetch the element “image”.

Running a DAG in Airflow

for image_url in image_urls:

try:

response = requests.get(image_url)

image_filename = image_url.split("/")[-1]

target_file = f"/tmp/images/{image_filename}"

with open(target_file, "wb") as f:

f.write(response.content)

print(f"Downloaded {image_url} to {target_file}")

except requests_exceptions.MissingSchema:

print(f"{image_url} appears to be an invalid URL.")

except requests_exceptions.ConnectionError:

print(f"Could not connect to {image_url}.")

2.3 Running a DAG in Airflow

Now that we have our basic rocket launch DAG, let’s get it up and running and view it

in the Airflow UI. The bare minimum Airflow consists of three core components: a

scheduler, a webserver, and a database. In order to get Airflow up and running, you

can either install Airflow in your Python environment or run a Docker container.

2.3.1 Running Airflow in a Python environment

There are several steps to installing and running Airflow as a Python package from PyPi:

pip install apache-airflow

Make sure you install

apache-airflow

and not just

airflow

. After joining the Apache

Foundation in 2016, the PyPi

airflow

repository was renamed to

apache-airflow

Since many people were still installing

airflow

instead of removing the old reposi-

tory, it was kept as a dummy to provide everybody a message pointing to the correct

repository.

Some operating systems come with a Python installation. Running just

pip

install

apache-airflow

will install Airflow in this “system” environment. When

working on Python projects, it is desirable to keep each project in its own Python envi-

ronment to create a reproducible set of Python packages and avoid dependency

clashes. Such environments are created with tools such as these:

pyenv: https://github.com/pyenv/pyenv

Conda: https://docs.conda.io

virtualenv: https://virtualenv.pypa.io

After installing Airflow, start it by initializing the metastore (a database in which all

Airflow state is stored), creating a user, copying the rocket launch DAG into the DAGs

directory, and starting the scheduler and webserver:

airflow

init

airflow

users

create

--username

admin

--password

admin

--firstname

Anon-

ymous

--lastname

Admin

--role

Admin

--email

admin@example.org

Listing 2.9 Downloads all images from the retrieved image URLs

Loop over

all image

URLs.

Get the image.

Get only the filename by

selecting everything after the

last. For example, https://

host/RocketImages/Electron

.jpg_1440.jpg  Electron.jpg

_1440.jpg.

Construct

the target

file path.

Open target

file handle.

Write image

to file path.

Print result.

Catch and process

potential errors.

30 CHAPTER 2 Anatomy of an Airflow DAG

download_rocket_launches.py

~/airflow/dags/

airflow

webserver

airflow

scheduler

Note the scheduler and webserver are both continuous processes that keep your ter-

minal open, so either run in the background with

airflow

webserver

and/or open a

second terminal window to run the scheduler and webserver separately. After you’re

set up, go to http:/ / localhost:8080 and log in with username “admin” and password

“admin” to view Airflow.

2.3.2 Running Airflow in Docker containers

Docker containers are also popular to create isolated environments to run a reproduc-

ible set of Python packages and avoid dependency clashes. However, Docker contain-

ers create an isolated environment on the operating system level, whereas Python

environments isolate only on the Python runtime level. As a result, you can create

Docker containers that contain not only a set of Python packages, but also other

dependencies such as database drivers or a GCC compiler. Throughout this book we

will demonstrate Airflow running in Docker containers in several examples.

Running Docker containers requires a Docker Engine to be installed on your

machine. You can then run Airflow in Docker with the following command.

docker run \

-ti \

-p 8080:8080 \

-v ➥ /path/to/dag/download_rocket_launches.py:/opt/airflow/dags/

download_rocket_launches.py \

--entrypoint=/bin/bash \

--name airflow \

apache/airflow:2.0.0-python3.8 \

-c '( \

airflow db init && \

➥ airflow users create --username admin --password admin --firstname

Anonymous --lastname Admin --role Admin --email admin@example.org \

); \

airflow webserver & \

airflow scheduler \

NOTE If you’re familiar with Docker, you would probably argue it’s not desir-

able to run multiple processes in a single Docker container as shown in list-

ing 2.10. The command is a single command, intended for demonstration

purposes to get up and running quickly. In a production setting, you should

run the Airflow webserver, scheduler, and metastore in separate containers,

explained in detail in chapter 10.

Listing 2.10 Running Airflow in Docker

Expose on host

port 8080.

Mount DAG file

in container.

Airflow Docker image

Initialize the metastore

in the container.

Create

a user.

Start the webserver.

Start the scheduler.

Running a DAG in Airflow

It will download and run the Airflow Docker image

apache/airflow

. Once running,

you can view Airflow on http:/ / localhost:8080 and log in with username “admin” and

password “admin”.

2.3.3 Inspecting the Airflow UI

The first view of Airflow on http:/ / localhost:8080 you will see is the login screen,

shown in figure 2.6.

After logging in, you can inspect the download_rocket_launches DAG, as shown in fig-

ure 2.7.

This is the first glimpse of Airflow you will see. Currently, the only DAG is the

download_rocket_launches, which is available to Airflow in the DAGs directory.

There’s a lot of information on the main view, but let’s inspect the download_rocket

_launches DAG first. Click on the DAG name to open it and inspect the so-called

graph view (figure 2.8).

Figure 2.6 Airflow login view

32 CHAPTER 2 Anatomy of an Airflow DAG

This view shows us the structure of the DAG script provided to Airflow. Once placed in

the DAGs directory, Airflow will read the script and pull out the bits and pieces that

together form a DAG, so it can be visualized in the UI. The graph view shows us the

structure of the DAG, and how and in which order all tasks in the DAG are connected

and will be run. This is one of the views you will probably use the most while develop-

ing your workflows.

The state legend shows all colors you might see when running, so let’s see what

happens and run the DAG. First, the DAG needs to be “on” in order to be run; toggle

the button next to the DAG name for that. Next, click the Play button to run it.

Figure 2.7 Airflow home screen

DAG structure

Operator

types in DAG State legend

Toggle DAG on/off Trigger DAG

Figure 2.8 Airflow graph view

Running at regular intervals

After triggering the DAG, it will start running and you will see the current state of the

workflow represented by colors (figure 2.9). Since we set dependencies between our

tasks, consecutive tasks only start running once the previous tasks have been com-

pleted. Let’s check the result of the notify task. In a real use case, you probably want to

send an email or, for example, Slack notification to inform about the new images. For

sake of simplicity, it now prints the number of downloaded images. Let’s check the logs.

All task logs are collected in Airflow, so we can search in the UI for output or

potential issues in case of failure. Click on a completed notify task, and you will see a

pop-up with several options, as shown in figure 2.10.

Click on the top-center Log button to inspect the logs, as shown in figure 2.11. The

logs are quite verbose by default but display the number of downloaded images in

the log. Finally, we can open the /tmp/images directory and view them. When run-

ning in Docker, this directory only exists inside the Docker container and not on your

host system. You must therefore first get into the Docker container:

docker exec -it airflow /bin/bash

After that you get a Bash terminal in the container and can view the images in

/tmp/images, (figure 2.12).

2.4 Running at regular intervals

Rocket enthusiast John is happy now that he has a workflow up and running in Air-

flow, which he can trigger every now and then to collect the latest rocket pictures. He

can see the status of his workflow in the Airflow UI, which is already an improvement

compared to a script on the command line he was running before. But he still needs

to trigger his workflow by hand periodically, which could be automated. After all,

nobody likes doing repetitive tasks that computers are good at doing themselves.

Figure 2.9 Graph view displaying a running DAG

34 CHAPTER 2 Anatomy of an Airflow DAG

Figure 2.10 Task pop-up options

Figure 2.11 Print statement displayed in logs

Running at regular intervals

In Airflow, we can schedule a DAG to run at certain intervals, for example once an

hour, day, or month. This is controlled on the DAG by setting the

schedule_interval

argument.

dag = DAG(

dag_id="download_rocket_launches",

start_date=airflow.utils.dates.days_ago(14),

schedule_interval="@daily",

)

Setting the

schedule_interval

@daily

tells Airflow to run this workflow once a day

so that John doesn’t have to trigger it manually once a day. This behavior is best

viewed in the tree view, as shown in figure 2.13.

The tree view is similar to the graph view but displays the graph structure as it runs

over time. An overview of the status of all runs of a single workflow can be seen in fig-

ure 2.14.

Listing 2.11 Running a DAG once a day

ariane252052520eca_image_

20190224012333.jpeg

electron_image_

20190705175640.jpeg

falcon25209_image_

20190224025007.jpeg

falcon2520heavy_image_

20190224025007.jpeg

h-iia2520202_image_

20190222031201.jpeg

kuaizhou_image_

20191027094423.jpeg

long2520march25202d_image_

20190222031211.jpeg

long2520march25203_image_

20200102181012.jpg

soyuz25202.1b_image_

20190520165337.jpg

ﬁreﬂy_alpha_image_

20200817170720.jpg

Figure 2.12 Resulting rocket pictures

Airflow alias for 0 0

* * * (i.e., midnight)

36 CHAPTER 2 Anatomy of an Airflow DAG

The structure of the DAG is displayed to fit a “rows and columns” layout, specifically

the status of all runs of the specific DAG, where each column represents a single run

at some point in time.

When we set the

schedule_interval

@daily

, Airflow knew it had to run this

DAG once a day. Given the

start_date

provided to the DAG of 14 days ago, that

means the time from 14 days ago up to now can be divided into 14 equal intervals of

one day. Since both the start and end date of these 14 intervals lie in the past, they will

start running once we provide a

schedule_interval

to Airflow. The semantics of

the schedule interval and various ways to configure it are covered in more detail in

chapter 3.

2.5 Handling failing tasks

So far we’ve seen only green in the Airflow UI. But what happens if something fails?

It’s not uncommon for tasks to fail, which could be for a multitude of reasons (e.g., an

external service is down, network connectivity issues, or a broken disk).

Say, for example, at some point we experienced a network hiccup while getting

John’s rocket pictures. As a consequence, the Airflow task fails, and we see the failing

task in the Airflow UI. It would look figure 2.15.

DAG structure Task state over time

Figure 2.13 Airflow tree view

Figure 2.14 Relationship between graph view and tree view

Handling failing tasks

The specific failed task would be displayed in red in both the graph and tree views, as

a result of not being able to get the images from the internet, and therefore raise an

error. The successive

notify

task would not run at all because it’s dependent on the

successful state of the

get_pictures

task. Such task instances are displayed in orange.

By default, all previous tasks must run successfully, and any successive task of a failed

task will not run.

Let’s figure out the issue by inspecting the logs again. Open the logs of the

get_

pictures

task (figure 2.16).

Figure 2.15 Failure displayed in graph view and tree view

Figure 2.16 Stack trace of failed get_pictures task

38 CHAPTER 2 Anatomy of an Airflow DAG

In the stack trace, we uncover the potential cause of the issue:

urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection

object at 0x7f37963ce3a0>: Failed to establish a new connection: [Errno

-2] Name or service not known

This indicates urllib3 (i.e., the HTTP client for Python) is trying to establish a connec-

tion but cannot, which could hint at a firewall rule blocking the connection or no

internet connectivity. Assuming we fixed the issue (e.g., plugged in the internet cable),

let’s restart the task.

NOTE It is unnecessary to restart the entire workflow. A nice feature of Air-

flow is that you can restart from the point of failure and onward, without hav-

ing to restart any previously succeeded tasks.

Click the failed task, and then click the Clear button in the pop-up (figure 2.17). It

will show you the tasks you’re about to clear, meaning you will reset the state of these

tasks and Airflow will rerun them, as shown in figure 2.18.

Click OK! and the failed task and its successive tasks will be cleared, as can be seen in

figure 2.19.

Figure 2.17 Click on a failed task

for options to clear it.

Figure 2.18 Clearing the state of get_pictures and successive tasks

Figure 2.19 Cleared tasks

displayed in graph view

Summary

Assuming the connectivity issues are resolved, the tasks will now run successfully and

make the whole tree view green (figure 2.20).

In any piece of software, there are many reasons for failure. In Airflow workflows,

sometimes failure is accepted, sometimes it is not, and sometimes it is only in certain

conditions. The criteria for dealing with failure can be configured on any level in the

workflow and is covered in more detail in chapter 4.

After clearing the failed tasks, Airflow will automatically rerun these tasks. If all

goes well, John will now have downloaded the rocket images resulting from the failed

tasks. Note that the called URL in the

download_launches

task simply requests the

next rocket launches—meaning it will return the next rocket launches at the time of

calling the API. Incorporating the runtime context at which a DAG was run into your

code is covered in chapter 4.

Summary

Workflows in Airflow are represented in DAGs.

Operators represent a single unit of work.

Airflow contains an array of operators both for generic and specific types of

work.

The Airflow UI offers a graph view for viewing the DAG structure and tree view

for viewing DAG runs over time.

Failed tasks can be restarted anywhere in the DAG.

Figure 2.20 Successfully completed

tasks after clearing failed tasks

Scheduling in Airflow

In the previous chapter, we explored Airflow’s UI and showed you how to define a

basic Airflow DAG and run it every day by defining a scheduled interval. In this

chapter, we will dive a bit deeper into the concept of scheduling in Airflow and

explore how this allows you to process data incrementally at regular intervals. First,

we’ll introduce a small use case focused on analyzing user events from our website

and explore how we can build a DAG to analyze these events at regular intervals.

Next, we’ll explore ways to make this process more efficient by taking an incremen-

tal approach to analyzing our data and understanding how this ties into Airflow’s

concept of execution dates. Finally, we’ll finish by showing how we can fill in past

gaps in our data set using backfilling and discussing some important properties of

proper Airflow tasks.

This chapter covers

Running DAGs at regular intervals

Constructing dynamic DAGs to process data

incrementally

Loading and reprocessing past data sets using

backfilling

Applying best practices for reliable tasks

An example: Processing user events

3.1 An example: Processing user events

To understand how Airflow’s scheduling works, we’ll first consider a small example.

Imagine we have a service that tracks user behavior on our website and allows us to

analyze which pages users (identified by an IP address) accessed. For marketing pur-

poses, we would like to know how many different pages users access and how much

time they spend during each visit. To get an idea of how this behavior changes over

time, we want to calculate these statistics daily, as this allows us to compare changes

across different days and larger time periods.

For practical reasons, the external tracking service does not store data for more

than 30 days, so we need to store and accumulate this data ourselves, as we want to

retain our history for longer periods of time. Normally, because the raw data might be

quite large, it would make sense to store this data in a cloud storage service such as

Amazon’s S3 or Google’s Cloud Storage, as they combine high durability with rela-

tively low costs. However, for simplicity’s sake, we won’t worry about these things and

will keep our data locally.

To simulate this example, we have created a simple (local) API that allows us to

retrieve user events. For example, we can retrieve the full list of available events from

the past 30 days using the following API call:

curl -o /tmp/events.json http:/ /localhost:5000/events

This call returns a (JSON-encoded) list of user events we can analyze to calculate our

user statistics.

Using this API, we can break our workflow into two separate tasks: one for fetching

user events and another for calculating the statistics. The data itself can be down-

loaded using the

BashOperator

, as we saw in the previous chapter. For calculating the

statistics, we can use a

PythonOperator

, which allows us to load the data into a Pandas

DataFrame and calculate the number of events using a groupby and an aggregation.

Altogether, this gives us the DAG shown in listing 3.1.

import datetime as dt

from pathlib import Path

import pandas as pd

from airflow import DAG

from airflow.operators.bash import BashOperator

from airflow.operators.python import PythonOperator

dag = DAG(

dag_id="01_unscheduled",

start_date=dt.datetime(2019, 1, 1),

schedule_interval=None,

)

Listing 3.1 Initial (unscheduled) version of the event DAG (dags/01_unscheduled.py)

Define the start

date for the DAG.

Specify that this is an

unscheduled DAG.

42 CHAPTER 3 Scheduling in Airflow

fetch_events = BashOperator(

task_id="fetch_events",

bash_command=(

"mkdir -p /data && "

"curl -o /data/events.json "

"https:/ /localhost:5000/events"

dag=dag,

)

def _calculate_stats(input_path, output_path):

"""Calculates event statistics."""

events = pd.read_json(input_path)

stats = events.groupby(["date", "user"]).size().reset_index()

Path(output_path).parent.mkdir(exist_ok=True)

stats.to_csv(output_path, index=False)

calculate_stats = PythonOperator(

task_id="calculate_stats",

python_callable=_calculate_stats,

op_kwargs={

"input_path": "/data/events.json",

"output_path": "/data/stats.csv",

dag=dag,

)

fetch_events >> calculate_stats

Now we have our basic DAG, but we still need to make sure it’s run regularly by Air-

flow. Let’s get it scheduled so that we have daily updates!

3.2 Running at regular intervals

As we saw in chapter 2, Airflow DAGs can be run at regular intervals by defining a sched-

uled interval for it using the

schedule_interval

argument when initializing the DAG.

By default, the value of this argument is None, which means the DAG will not be sched-

uled and will be run only when triggered manually from the UI or the API.

3.2.1 Defining scheduling intervals

In our example of ingesting user events, we would like to calculate statistics daily, so it

would make sense to schedule our DAG to run once every day. As this is a common

use case, Airflow provides the convenient macro

@daily

for defining a daily scheduled

interval, which runs our DAG once every day at midnight.

Fetch and store the

events from the API.

Load the events

and calculate

the required

statistics.

Make sure the

output directory

exists and write

results to CSV.

Set order of

execution.

Running at regular intervals

dag = DAG(

dag_id="02_daily_schedule",

schedule_interval="@daily",

start_date=dt.datetime(2019, 1, 1),

...

)

Airflow also needs to know when we want to start executing the DAG, specified by its

start date. Based on this start date, Airflow will schedule the first execution of our

DAG to run at the first schedule interval after the start date (start + interval). Subse-

quent runs will continue executing at schedule intervals following this first interval.

NOTE Pay attention to the fact that Airflow starts tasks in an interval at the

end of the interval. If developing a DAG on January 1, 2019 at 13:00, with a

start_date

of 01-01-2019 and

@daily

interval, this means it first starts run-

ning at midnight. At first, nothing will happen if you run the DAG on January

1 at 13:00 until midnight is reached.

For example, say we define our DAG with a start date on the first of January, as previ-

ously shown in listing 3.2. Combined with a daily scheduling interval, this will result in

Airflow running our DAG at midnight on every day following the first of January (fig-

ure 3.1). Note that our first execution takes place on the second of January (the first

interval following the start date) and not the first. We’ll get into the reasoning behind

this behavior later in this chapter (section 3.4).

Without an end date, Airflow will (in principle) keep executing our DAG on this daily

schedule until the end of time. However, if we already know that our project has a

fixed duration, we can tell Airflow to stop running our DAG after a certain date using

the

end_date

parameter.

Listing 3.2 Defining a daily schedule interval (dags/02_daily_schedule.py)

Schedule the DAG to run

every day at midnight.

Date/time to start

scheduling DAG runs

Start

date

First

execution

Second

execution

Third

execution

Future

executions

2019-01-01

00:00

2019-01-02

00:00

2019-01-03

00:00

2019-01-04

00:00

Figure 3.1 Schedule intervals for a daily scheduled DAG with a specified start

date (2019-01-01). Arrows indicate the time point at which a DAG is executed.

Without a specified end date, the DAG will keep being executed every day until

the DAG is switched off.

44 CHAPTER 3 Scheduling in Airflow

dag = DAG(

dag_id="03_with_end_date",

schedule_interval="@daily",

start_date=dt.datetime(year=2019, month=1, day=1),

end_date=dt.datetime(year=2019, month=1, day=5),

)

This will result in the full set of schedule intervals shown in figure 3.2.

3.2.2 Cron-based intervals

So far, all our examples have shown DAGs running at daily intervals. But what if we

want to run our jobs on hourly or weekly intervals? And what about more complicated

intervals in which we may want to run our DAG at 23:45 every Saturday?

To support more complicated scheduling intervals, Airflow allows us to define

scheduling intervals using the same syntax as used by cron, a time-based job scheduler

used by Unix-like computer operating systems such as macOS and Linux. This syntax

consists of five components and is defined as follows:

# ┌─────── minute (0 - 59)

# │ ┌────── hour (0 - 23)

# │ │ ┌───── day of the month (1 - 31)

# │ │ │ ┌───── month (1 - 12)

# │ │ │ │ ┌──── day of the week (0 - 6) (Sunday to Saturday;

# │ │ │ │ │ 7 is also Sunday on some systems)

# * * * * *

In this definition, a cron job is executed when the time/date specification fields

match the current system time/date. Asterisks (*) can be used instead of numbers to

define unrestricted fields, meaning we don’t care about the value of that field.

Although this cron-based representation may seem a bit convoluted, it provides us

with considerable flexibility for defining time intervals. For example, we can define

hourly, daily, and weekly intervals using the following cron expressions:

Listing 3.3 Defining an end date for the DAG (dags/03_with_end_date.py)

Start

date

First

execution

Second

execution

Third

execution

Fourth

execution

Final

execution

End

date

2019-01-01