100% found this document useful (1 vote)

408 views6 pages

Airflow DAG Best Practices Guide

This document outlines best practices for designing Airflow DAGs (directed acyclic graphs) including: keeping DAGs lightweight like configuration files; separating non-configuration code; investing in plugins; avoiding data processing in DAG files; delegating API/DB calls to operators; making DAGs/tasks idempotent; using single variables per DAG; tagging DAGs; limiting use of XCom; using intermediate storage between tasks; leveraging Jinja templating; implementing access control; using static start dates; renaming DAGs during structural changes; setting retries and notifications. It also recommends keeping an eye on upcoming Airflow enhancements.

Uploaded by

Deepak Mane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

408 views6 pages

Airflow DAG Best Practices Guide

Uploaded by

Deepak Mane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Airflow DAG — Best

Practices

DAG as configuration file

Airflow scheduler scans and compiles DAG files at each heartbeat.

If DAG files are heavy and a lot of top-level codes are present in
them, the scheduler will consume a lot of resources and time to
process them at each heartbeat. So it is advised to keep the DAGs
light, more like a configuration file. As a step forward it will be a
good choice to have a YAML/JSON-based definition of workflow
and then generating the DAG, based on that. This has a double
advantage. 1. DAGs since are getting generated programmatically
will be consistent and reproducible anytime. 2. Non-python users
will also be able to use it.

We can separate non-configuration-related code blocks outside

the DAG definition and use the template_searchpath attribute to
add those. For example, if you are trying to connect to an RDS and
execute some SQL command, that SQL command should be
loaded from a file. And the location of the file should be mentioned
in the template_searchpath. Similarly with Hive queries(.hql).

Invest in Airflow plugin system

Have a proper plugin repo and maintain it to author custom

plugins needed as per the organization’s requirement. While
creating a plugin, be generic so that it is reusable across use cases.
This helps in versioning as well as it helps in keeping workflows
clean and mostly configuration details as opposed to
implementation logic. Also don’t perform heavy work/operation
while initializing the class, push operations inside the execution
method.

Do not perform data processing in DAG files.

Since DAGs are python-based, we will definitely be tempted to use

pandas or similar stuff in DAG, but we should not. Airflow is an
orchestrator, not an execution framework. All computation should
be delegated to a specific target system.

Delegate API or DB calls to operators

This is somewhat similar to the first point. API call or DB
connection made at top-level code in DAG files overloads the
webserver. These call defined outside of operator is called every
heartbeat. So it is advisable to have these pushed down to a
util/common (can be a python operator) operator.

Make DAGs/Tasks idempotent

DAG should produce the same data on every run. Read from a
partition and write to a partition. Both should be immutable.
Handle partition creation and deletion to avoid unknown errors.

Use single variable per DAG

Every time we access DAG variables it creates a connection to

metadata DB. It may overload the Db if we are having multiple
DAGs running with multiple variables being called. It's better to
use a single variable per DAG with a JSON object. This will create
a single connection. We can parse the JSON to get the desired key-
value pair.

Tag the DAG

Having Tags in DAG helps in filtering and grouping DAGs. Make it
consistent with your infrastructure’s current tagging system. Like
tag based on BU, Project, App Category, etc.

Don’t Abuse XCom

XCom acts as a channel between tasks for sharing data. It uses

backend DB to do so. Hence we should not pass a huge amount of
data using this, as with a bigger amount of data the backend DB
will get overloaded.

Use intermediate storage between tasks.

If data to be shared between two tasks are huge store it in an

intermediate storage system. And pass the reference of it to the
downstream task.

Use the power of Jinja templating

Many of the operators support template_fields. This tuple object

defines which fields will get jinjaified.
class PythonOperator(BaseOperator):
template_fields = ('templates_dict', 'op_args', 'op_kwargs')

While writing your custom operator overrides this template_fields

attribute.
class CustomBashOperator(BaseOperator):
template_fields = ('file_name', 'command', 'dest_host')

The above example is the fields ‘file_name’, ‘command’,

‘dest_host’ will be available for jinja templating.

Implement DAG Level Access control

Leverage Flask App Builder views to have DAG level access

control. Set the DAG owner to correct Linux user. Create a custom
role to decide who can take DAG/Task actions.

Use static start_date

Static DAG start date helps in, correctly populating DAG runs and
schedule.

Rename DAGs in case of structural change

Till the time the DAG versioning feature is implemented, in case of

any structural change in DAG rename the DAG on changes. This
will create a new DAG and all DAG history of the previous run for
the old DAG version will be there without any inconsistency.

Some other best practices:

 Set retries at the DAG level

 Use consistent file structure

 Choose a consistent method for task dependencies

 Have notification strategy on failure

Keep an eye on the upcoming enhancements to airflow:

 Functional DAG

 DAG Serialization

 Scheduler HA

 Production grade REST APIs

 Smart Sensors

 Task Groups

Have fun with DAGs

Best Practices Apache Airflow
100% (1)
Best Practices Apache Airflow
28 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Airflow for Data Pipeline Management
100% (1)
Airflow for Data Pipeline Management
6 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Airflow Documentation 080818
100% (1)
Airflow Documentation 080818
157 pages
Airflow User Guide
No ratings yet
Airflow User Guide
444 pages
Apache Airflow Workflow Guide
No ratings yet
Apache Airflow Workflow Guide
14 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Python Data Pipeline Guide
No ratings yet
Python Data Pipeline Guide
38 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
How To Work With Apache Airflow
No ratings yet
How To Work With Apache Airflow
111 pages
Airflow 101 Mobile
No ratings yet
Airflow 101 Mobile
48 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Snowflake
No ratings yet
Snowflake
122 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
Snowflake
No ratings yet
Snowflake
11 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Snowflake - Billing Components
No ratings yet
Snowflake - Billing Components
9 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Snowflake For: Data Engineering
No ratings yet
Snowflake For: Data Engineering
15 pages
Snowpro Advanced: Data Engineer: Exam Study Guide
No ratings yet
Snowpro Advanced: Data Engineer: Exam Study Guide
14 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Best Practices For Using Tableau With Snowflake
100% (1)
Best Practices For Using Tableau With Snowflake
63 pages
Snowpro™ Advanced: Data Engineer: Exam Study Guide
No ratings yet
Snowpro™ Advanced: Data Engineer: Exam Study Guide
16 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Pyspark Vs Pandas Cheatsheet
No ratings yet
Pyspark Vs Pandas Cheatsheet
3 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Dataeng-Zoomcamp - 4 - Analytics - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 4 - Analytics - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
26 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Apache Spark Interview Questions Guide
100% (1)
Apache Spark Interview Questions Guide
7 pages
Top 100+ Data Engineer Interview Questions and Answers For 2022
No ratings yet
Top 100+ Data Engineer Interview Questions and Answers For 2022
4 pages
Airflow 101
No ratings yet
Airflow 101
25 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
AWS DataEngineering
100% (1)
AWS DataEngineering
23 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Airflow Best Practices
No ratings yet
Airflow Best Practices
34 pages
98 Exploring DAG Design Patterns in Apache Airflow
No ratings yet
98 Exploring DAG Design Patterns in Apache Airflow
32 pages
Scenario Based Airflow Interview Questions
No ratings yet
Scenario Based Airflow Interview Questions
4 pages
Airflow CLI Guide for Developers
No ratings yet
Airflow CLI Guide for Developers
10 pages
Apache Airflow Certification - Study Guide For DAG Authoring
No ratings yet
Apache Airflow Certification - Study Guide For DAG Authoring
17 pages
Airflow Git CICD
No ratings yet
Airflow Git CICD
6 pages
Using The Matrices Palette in S Math Studio
No ratings yet
Using The Matrices Palette in S Math Studio
7 pages
Vodacom Nokia Weekly NQ Meeting Week-09
No ratings yet
Vodacom Nokia Weekly NQ Meeting Week-09
40 pages
Interrupt Interface of The 8088 and 8086 Microprocessor
No ratings yet
Interrupt Interface of The 8088 and 8086 Microprocessor
32 pages
Develop Lofting and Lathe Creation in 3d Animation
No ratings yet
Develop Lofting and Lathe Creation in 3d Animation
5 pages
WEG CFW 09 24 VDC Control Supply Connections Guide 0899.5628 Installation Guide English
No ratings yet
WEG CFW 09 24 VDC Control Supply Connections Guide 0899.5628 Installation Guide English
2 pages
Job 15-J51 Welding A Single-V Butt Joint
No ratings yet
Job 15-J51 Welding A Single-V Butt Joint
8 pages
Properties of Acetone
No ratings yet
Properties of Acetone
25 pages
CEM 515 SPC Quiz Student Name: - Student No
No ratings yet
CEM 515 SPC Quiz Student Name: - Student No
2 pages
Tau 206ma Manual
No ratings yet
Tau 206ma Manual
15 pages
Devara Pathigam 2
No ratings yet
Devara Pathigam 2
86 pages
LG 20LS5R Chassis LP68A PDF
No ratings yet
LG 20LS5R Chassis LP68A PDF
30 pages
Trigonometry Extra Questions
No ratings yet
Trigonometry Extra Questions
3 pages
Database Structure of Accounting Systems
67% (3)
Database Structure of Accounting Systems
4 pages
Paper DFS MIT Solving The "False Positives" Problem in Fraud Prediction
No ratings yet
Paper DFS MIT Solving The "False Positives" Problem in Fraud Prediction
14 pages
Concrete Beam Crack Prevention
No ratings yet
Concrete Beam Crack Prevention
3 pages
Edwards - Human Genetic Diversity - Lewontin's Fallacy
100% (1)
Edwards - Human Genetic Diversity - Lewontin's Fallacy
4 pages
Modern Rent & Wage Theories Explained
No ratings yet
Modern Rent & Wage Theories Explained
13 pages
Caryaire Air Volume Units Info
No ratings yet
Caryaire Air Volume Units Info
8 pages
Colour
No ratings yet
Colour
20 pages
Cement Content of Freshly Mixed Soil-Cement: Standard Test Method For
No ratings yet
Cement Content of Freshly Mixed Soil-Cement: Standard Test Method For
4 pages
Designs, Performance and Economic Feasibility of Domestic Solar Dryers
No ratings yet
Designs, Performance and Economic Feasibility of Domestic Solar Dryers
31 pages
Khan Chap-3
No ratings yet
Khan Chap-3
37 pages
Grade 11 Maths P1 Nov 2024 QP
No ratings yet
Grade 11 Maths P1 Nov 2024 QP
15 pages
تشييك
No ratings yet
تشييك
30 pages
AIYB-015-Xxx-031 - AA para Gabinete 1500W DBS de TEMPEL
No ratings yet
AIYB-015-Xxx-031 - AA para Gabinete 1500W DBS de TEMPEL
2 pages
Thermo Well
No ratings yet
Thermo Well
3 pages
(61B SP25) Lecture 40 - Summary
No ratings yet
(61B SP25) Lecture 40 - Summary
37 pages
Lexical Semantics
No ratings yet
Lexical Semantics
32 pages
Single Index Model
100% (1)
Single Index Model
3 pages

Airflow DAG Best Practices Guide

Uploaded by

Airflow DAG Best Practices Guide

Uploaded by

Airflow DAG — Best

DAG as configuration file

Airflow scheduler scans and compiles DAG files at each heartbeat.

We can separate non-configuration-related code blocks outside

Invest in Airflow plugin system

Have a proper plugin repo and maintain it to author custom

Do not perform data processing in DAG files.

Since DAGs are python-based, we will definitely be tempted to use

Delegate API or DB calls to operators

Make DAGs/Tasks idempotent

Use single variable per DAG

Every time we access DAG variables it creates a connection to

Tag the DAG

Don’t Abuse XCom

XCom acts as a channel between tasks for sharing data. It uses

Use intermediate storage between tasks.

If data to be shared between two tasks are huge store it in an

Use the power of Jinja templating

Many of the operators support template_fields. This tuple object

While writing your custom operator overrides this template_fields

The above example is the fields ‘file_name’, ‘command’,

Implement DAG Level Access control

Leverage Flask App Builder views to have DAG level access

Use static start_date

Rename DAGs in case of structural change

Till the time the DAG versioning feature is implemented, in case of

Some other best practices:

 Use consistent file structure

 Choose a consistent method for task dependencies

 Have notification strategy on failure

Keep an eye on the upcoming enhancements to airflow:

 Production grade REST APIs

Have fun with DAGs

You might also like