Skip to content

This is a modular, containerized data pipeline that extracts, loads, and transforms data from Telegram channels into a PostgreSQL database for analytics and reporting. It integrates Python, Docker, and dbt to automate data collection, enrichment, and transformation, providing a reproducible and scalable workflow for real-time data analysis.

Notifications You must be signed in to change notification settings

b-isry/Ethio-medical

Repository files navigation

Kara Solutions ELT Project

Kara Solutions ELT is a modular pipeline for extracting, loading, and transforming data from Telegram channels into a PostgreSQL database for analytics and reporting. The project leverages Docker for reproducible environments, dbt for analytics engineering, and Python for data collection and orchestration.

Features

  • Scrape messages and media from multiple Telegram channels
  • Store raw and processed data in a structured data lake
  • Load data into PostgreSQL for further analysis
  • Use dbt for analytics, transformation, and reporting
  • Containerized setup for easy deployment and reproducibility

Folder Structure


├── analytical_api/      # FastAPI app: CRUD, database, models, schemas
├── data/                # Raw and processed data storage
│   └── raw/             # Raw ingested data (images, messages)
├── ethio_medical_dbt/   # dbt analytics project (models, macros, seeds, snapshots, logs)
├── logs/                # Log files for scraping, enrichment, loading, dbt
├── notebooks/           # Jupyter notebooks for exploration and analysis (empty by default)
├── orchestration/       # Orchestration logic (e.g., jobs.py)
├── scripts/             # Python scripts for scraping, enrichment, loading
├── requirements.txt     # Python dependencies
├── Dockerfile           # Docker image definition
├── docker-compose.yml   # Multi-container orchestration
├── .env                 # Environment variables (API keys, DB credentials)
└── README.md            # Project documentation

Key Directories:

  • analytical_api/: FastAPI backend for CRUD and data access.
  • ethio_medical_dbt/: dbt project for analytics engineering (models, macros, seeds, etc.).
  • orchestration/: Python orchestration logic (e.g., job scheduling, ETL coordination).
  • scripts/: Standalone scripts for scraping, enrichment, and loading data.
  • data/raw/: Stores raw ingested data (images, messages) from Telegram channels.
  • logs/: All log files generated by scripts and dbt.

Getting Started

Prerequisites

  • Docker & Docker Compose (for containerized workflow)
  • Telegram API credentials (API_ID, API_HASH)
  • Python 3.10+ (if running scripts outside Docker)
  • PostgreSQL (runs in Docker by default)

Environment Variables

Create a .env file in the project root with the following content:

# Telegram API Credentials
API_ID=your_telegram_api_id
API_HASH=your_telegram_api_hash

# Database Credentials
POSTGRES_USER=user
POSTGRES_PASSWORD=your_password
POSTGRES_DB=medical_db
DB_HOST=db
DB_PORT=5432

Setup

  1. Clone the repository and navigate to the project folder:

    git clone (https://github.com/b-isry/Ethio-medical.git)
    cd kara_solutions_elt
  2. Add your .env file as described above.

  3. Build and start the containers:

    docker-compose up --build
  4. (Optional) Install Python dependencies locally if running scripts outside Docker:

    pip install -r requirements.txt

Running the Scraper

  • The main scraping script is at scripts/scraper.py.

  • It logs to logs/scraping.log and saves data under data/raw/.

  • To run the scraper inside the container:

    docker-compose exec app python scripts/scraper.py  # Run the scraper in the app container

Example: Adding a New Channel

To scrape a new Telegram channel, add its handle to the CHANNELS list in scripts/scraper.py:

CHANNELS = [
    'lobelia4cosmetics',  # Example channel 1
    'tikvahpharma',       # Example channel 2
    # Add more channels as needed
]

Database

  • PostgreSQL runs in the db service (see docker-compose.yml).
  • Connection details are set via .env and used by dbt and scripts.
  • Data is persisted in a Docker volume (postgres_data).

Analytics & dbt

The analytics layer is managed with dbt, located in the ethio_medical_dbt/ directory. This includes models, macros, seeds, and snapshots for transforming and analyzing data.

To use dbt inside the container:

docker-compose exec app bash
cd ethio_medical_dbt
dbt debug    # Test dbt connection
dbt run      # Run dbt models

You can add or modify dbt models in ethio_medical_dbt/models/ and use macros or seeds as needed. See the dbt documentation for advanced analytics workflows.

Troubleshooting

  • If you get a FileNotFoundError for logs, ensure the logs/ directory exists or is created before running scripts.
  • For pip install timeouts, try increasing the timeout or using a different PyPI mirror in the Dockerfile.
  • If dbt reports git missing, ensure your Dockerfile installs git before Python dependencies.

Orchestration

The orchestration/ directory contains logic for managing ETL jobs and workflow scheduling. You can extend this to automate scraping, enrichment, and loading tasks as needed.

Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

Support

For questions or support, please open an issue in this repository.

License

MIT License

About

This is a modular, containerized data pipeline that extracts, loads, and transforms data from Telegram channels into a PostgreSQL database for analytics and reporting. It integrates Python, Docker, and dbt to automate data collection, enrichment, and transformation, providing a reproducible and scalable workflow for real-time data analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published