This project is an ETL (Extract, Transform, Load) pipeline that retrieves data from 2 websites.
- Data Extraction: Scrapes data from https://www.ft.com/ and https://www.theguardian.com/europe.
- Data Transformation: Counts occurrences of
election, war, economykeywords in the extracted data. - Data Loading: Inserts the processed data into a PostgreSQL database.
- Containerization: The project uses Docker for containerization.
- Scheduling: Uses Apache Airflow to schedule and manage the pipeline runs.
- Docker
- Docker Compose
- Clone the repository:
git clone https://github.com/AnetteTaivere/ETL_pipeline.git cd etl_pipeline - Buid and start Docker containers:
docker compose up -d
- Verify that the containers are running:
docker compose ps -a
-
Generate FERNET_KEY value using
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" -
Add key to AIRFLOW__CORE__FERNET_KEY
-
You also need to create airflow user. For this, take the webserver's docker container id, and go into the container.
docker exec -it <CONTAINER_ID> /bin/bashIn the container create new user with following command.airflow users create --username admin --password admin --firstname First --lastname Last --role Admin --email admin@example.com
Airflow - http://localhost:8080
- Login with created user
- Check for two running DAGs. Depending on the day, main.py is scheduled to run at the next available hour or the hour after.
Grafana - http://localhost:3000
- default login: admin/admin
- If new password is asked then skip
- Open Dashboards ->
keywords_dashboard - See data (only there is one of the dags have run)