This repository contains an end-to-end data pipeline built using Apache Airflow, Sling, DBT, and BigQuery. The pipeline ingests CSV data into a Postgres-based eCommerce database, replicates it to Google BigQuery, and performs transformations using DBT.
-
Custom Airflow Image:
A customairflow-slingDocker image was created by building a Dockerfile that adds theslingpackage on top of the Airflow image from Docker Hub. -
Docker Compose Services:
In addition to the mandatory Airflow services, thedocker-compose.ymlfile defines:ecommerce-db: A Postgres service for the eCommerce database.sling-service: A service that handles data ingestion from CSV files into theecommerce-dbusing Sling.
-
SLING Connector Setup:
- A SLING connector to the eCommerce Postgres database is defined using environment variables set in the
docker-compose.ymlfile. - All secrets are managed securely in a
.envfile, with only their corresponding keys referenced in the Docker Compose configuration.
- A SLING connector to the eCommerce Postgres database is defined using environment variables set in the
-
Data Ingestion:
- Using the Sling service, a
replication.yamlfile is used to move CSV files into the eCommerce Postgres service. - Once the services are up (
docker-compose up), the database is populated by running thesling runcommand.
- Using the Sling service, a
-
Airflow DAG Setup:
The Airflow DAG automates data processing using theBashOperatorto run Sling commands. The DAG consists of two primary tasks:set_sling_bigquery: Configures the SLING-BigQuery connector.use_sling: Moves data from the eCommerce Postgres database to BigQuery in incremental full data upsert mode, ensuring only new records are added and existing records are updated. This uses the table's primary key for deduplication.
-
DBT Transformations:
The data transformations are handled using dbt-core with the BigQuery adapter. The models are organized into subfolders:-
Staging Models: Materialized as views.
-
Intermediate Models: Materialized as ephemeral for optimized query performance.
-
Final Models: Materialized as tables.
-
DBT configuration is stored in the
dbt_profile.ymlfile, while the raw table definitions are documented insources.yml. -
schema.ymlfiles in theintermediateandfinalsubfolders document and test the models at each step.
-
├── dags/ # Airflow DAGs, also contains sling replication.yml file for postgres to bigquery
├── dbt/ # DBT project files
│ ├── models/
│ │ ├── staging/ # DBT models materialized as views
│ │ ├── intermediate/ # DBT models materialized as ephemeral
│ │ ├── final/ # DBT models materialized as tables
│ │ ├── sources.yml # Raw table metadata definitions
│ │ ├── schema.yml # Model documentation and tests
├── sling_files/replication.yaml # Sling replication file used at the start to load data from csv to postgresql
├── docker-compose.yml # Docker Compose configuration
├── docker/Dockerfile # Custom Airflow image setup
├── .env # Environment variables (secrets not committed)
└── README.md # This readme file- Docker and Docker Compose installed.
.envfile containing required secrets for database connections and services.
docker-compose up --buildThis command will:
- Build the custom Airflow image with the Sling package.
- Start the
ecommerce-dbandsling-servicefor data ingestion. - Launch Airflow for orchestrating the pipeline.
Once the services are up, the Sling service will run the sling command to ingest CSV data into the eCommerce Postgres database.
Airflow will orchestrate the pipeline. Open airflow and trigger the DAG. The DAG includes:
set_sling_bigquery: Sets up the SLING-BigQuery connector.use_sling: Replicates data from the Postgres database to BigQuery.
To run the DBT transformations, execute the following commands inside the dbt directory:
dbt run- Staging models are materialized into views.
- Intermediate models are ephemeral for optimized performance.
- Final models are materialized as tables.
To run dbt test and generate documentation, run:
dbt test
dbt docs generate- Ensure all required environment variables are configured in your
.envfile. - The
docker-compose.ymlfile is configured to handle secret management using environment keys. - Use the
replication.yamlfiles in both thesling_filesanddagsfolders to define how Sling ingests data from CSV into Postgres, and how data is replicated to BigQuery in incremental full upsert mode.