Kara Solutions ELT is a modular pipeline for extracting, loading, and transforming data from Telegram channels into a PostgreSQL database for analytics and reporting. The project leverages Docker for reproducible environments, dbt for analytics engineering, and Python for data collection and orchestration.
- Scrape messages and media from multiple Telegram channels
- Store raw and processed data in a structured data lake
- Load data into PostgreSQL for further analysis
- Use dbt for analytics, transformation, and reporting
- Containerized setup for easy deployment and reproducibility
├── analytical_api/ # FastAPI app: CRUD, database, models, schemas
├── data/ # Raw and processed data storage
│ └── raw/ # Raw ingested data (images, messages)
├── ethio_medical_dbt/ # dbt analytics project (models, macros, seeds, snapshots, logs)
├── logs/ # Log files for scraping, enrichment, loading, dbt
├── notebooks/ # Jupyter notebooks for exploration and analysis (empty by default)
├── orchestration/ # Orchestration logic (e.g., jobs.py)
├── scripts/ # Python scripts for scraping, enrichment, loading
├── requirements.txt # Python dependencies
├── Dockerfile # Docker image definition
├── docker-compose.yml # Multi-container orchestration
├── .env # Environment variables (API keys, DB credentials)
└── README.md # Project documentation
Key Directories:
analytical_api/: FastAPI backend for CRUD and data access.ethio_medical_dbt/: dbt project for analytics engineering (models, macros, seeds, etc.).orchestration/: Python orchestration logic (e.g., job scheduling, ETL coordination).scripts/: Standalone scripts for scraping, enrichment, and loading data.data/raw/: Stores raw ingested data (images, messages) from Telegram channels.logs/: All log files generated by scripts and dbt.
- Docker & Docker Compose (for containerized workflow)
- Telegram API credentials (
API_ID,API_HASH) - Python 3.10+ (if running scripts outside Docker)
- PostgreSQL (runs in Docker by default)
Create a .env file in the project root with the following content:
# Telegram API Credentials
API_ID=your_telegram_api_id
API_HASH=your_telegram_api_hash
# Database Credentials
POSTGRES_USER=user
POSTGRES_PASSWORD=your_password
POSTGRES_DB=medical_db
DB_HOST=db
DB_PORT=5432-
Clone the repository and navigate to the project folder:
git clone (https://github.com/b-isry/Ethio-medical.git) cd kara_solutions_elt -
Add your
.envfile as described above. -
Build and start the containers:
docker-compose up --build
-
(Optional) Install Python dependencies locally if running scripts outside Docker:
pip install -r requirements.txt
-
The main scraping script is at
scripts/scraper.py. -
It logs to
logs/scraping.logand saves data underdata/raw/. -
To run the scraper inside the container:
docker-compose exec app python scripts/scraper.py # Run the scraper in the app container
To scrape a new Telegram channel, add its handle to the CHANNELS list in scripts/scraper.py:
CHANNELS = [
'lobelia4cosmetics', # Example channel 1
'tikvahpharma', # Example channel 2
# Add more channels as needed
]- PostgreSQL runs in the
dbservice (seedocker-compose.yml). - Connection details are set via
.envand used by dbt and scripts. - Data is persisted in a Docker volume (
postgres_data).
The analytics layer is managed with dbt, located in the ethio_medical_dbt/ directory. This includes models, macros, seeds, and snapshots for transforming and analyzing data.
To use dbt inside the container:
docker-compose exec app bash
cd ethio_medical_dbt
dbt debug # Test dbt connection
dbt run # Run dbt modelsYou can add or modify dbt models in ethio_medical_dbt/models/ and use macros or seeds as needed. See the dbt documentation for advanced analytics workflows.
- If you get a
FileNotFoundErrorfor logs, ensure thelogs/directory exists or is created before running scripts. - For pip install timeouts, try increasing the timeout or using a different PyPI mirror in the Dockerfile.
- If dbt reports
gitmissing, ensure your Dockerfile installs git before Python dependencies.
The orchestration/ directory contains logic for managing ETL jobs and workflow scheduling. You can extend this to automate scraping, enrichment, and loading tasks as needed.
Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
For questions or support, please open an issue in this repository.
MIT License