Kara Solutions ELT Project

Kara Solutions ELT is a modular pipeline for extracting, loading, and transforming data from Telegram channels into a PostgreSQL database for analytics and reporting. The project leverages Docker for reproducible environments, dbt for analytics engineering, and Python for data collection and orchestration.

Features

Scrape messages and media from multiple Telegram channels
Store raw and processed data in a structured data lake
Load data into PostgreSQL for further analysis
Use dbt for analytics, transformation, and reporting
Containerized setup for easy deployment and reproducibility

Folder Structure


├── analytical_api/      # FastAPI app: CRUD, database, models, schemas
├── data/                # Raw and processed data storage
│   └── raw/             # Raw ingested data (images, messages)
├── ethio_medical_dbt/   # dbt analytics project (models, macros, seeds, snapshots, logs)
├── logs/                # Log files for scraping, enrichment, loading, dbt
├── notebooks/           # Jupyter notebooks for exploration and analysis (empty by default)
├── orchestration/       # Orchestration logic (e.g., jobs.py)
├── scripts/             # Python scripts for scraping, enrichment, loading
├── requirements.txt     # Python dependencies
├── Dockerfile           # Docker image definition
├── docker-compose.yml   # Multi-container orchestration
├── .env                 # Environment variables (API keys, DB credentials)
└── README.md            # Project documentation

Key Directories:

analytical_api/: FastAPI backend for CRUD and data access.
ethio_medical_dbt/: dbt project for analytics engineering (models, macros, seeds, etc.).
orchestration/: Python orchestration logic (e.g., job scheduling, ETL coordination).
scripts/: Standalone scripts for scraping, enrichment, and loading data.
data/raw/: Stores raw ingested data (images, messages) from Telegram channels.
logs/: All log files generated by scripts and dbt.

Getting Started

Prerequisites

Docker & Docker Compose (for containerized workflow)
Telegram API credentials (API_ID, API_HASH)
Python 3.10+ (if running scripts outside Docker)
PostgreSQL (runs in Docker by default)

Environment Variables

Create a .env file in the project root with the following content:

# Telegram API Credentials
API_ID=your_telegram_api_id
API_HASH=your_telegram_api_hash

# Database Credentials
POSTGRES_USER=user
POSTGRES_PASSWORD=your_password
POSTGRES_DB=medical_db
DB_HOST=db
DB_PORT=5432

Setup

Clone the repository and navigate to the project folder:

git clone (https://github.com/b-isry/Ethio-medical.git)
cd kara_solutions_elt

Add your .env file as described above.
Build and start the containers:
```
docker-compose up --build
```
(Optional) Install Python dependencies locally if running scripts outside Docker:
```
pip install -r requirements.txt
```

Running the Scraper

The main scraping script is at scripts/scraper.py.
It logs to logs/scraping.log and saves data under data/raw/.

To run the scraper inside the container:

docker-compose exec app python scripts/scraper.py  # Run the scraper in the app container

Example: Adding a New Channel

To scrape a new Telegram channel, add its handle to the CHANNELS list in scripts/scraper.py:

CHANNELS = [
    'lobelia4cosmetics',  # Example channel 1
    'tikvahpharma',       # Example channel 2
    # Add more channels as needed
]

Database

PostgreSQL runs in the db service (see docker-compose.yml).
Connection details are set via .env and used by dbt and scripts.
Data is persisted in a Docker volume (postgres_data).

Analytics & dbt

The analytics layer is managed with dbt, located in the ethio_medical_dbt/ directory. This includes models, macros, seeds, and snapshots for transforming and analyzing data.

To use dbt inside the container:

docker-compose exec app bash
cd ethio_medical_dbt
dbt debug    # Test dbt connection
dbt run      # Run dbt models

You can add or modify dbt models in ethio_medical_dbt/models/ and use macros or seeds as needed. See the dbt documentation for advanced analytics workflows.

Troubleshooting

If you get a FileNotFoundError for logs, ensure the logs/ directory exists or is created before running scripts.
For pip install timeouts, try increasing the timeout or using a different PyPI mirror in the Dockerfile.
If dbt reports git missing, ensure your Dockerfile installs git before Python dependencies.

Orchestration

The orchestration/ directory contains logic for managing ETL jobs and workflow scheduling. You can extend this to automate scraping, enrichment, and loading tasks as needed.

Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

Support

For questions or support, please open an issue in this repository.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kara Solutions ELT Project

Features

Folder Structure

Getting Started

Prerequisites

Environment Variables

Setup

Running the Scraper

Example: Adding a New Channel

Database

Analytics & dbt

Troubleshooting

Orchestration

Contributing

Support

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
analytical_api		analytical_api
ethio_medical_dbt		ethio_medical_dbt
orchestration		orchestration
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
definitions.py		definitions.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
yolov8n.pt		yolov8n.pt

b-isry/Ethio-medical

Folders and files

Latest commit

History

Repository files navigation

Kara Solutions ELT Project

Features

Folder Structure

Getting Started

Prerequisites

Environment Variables

Setup

Running the Scraper

Example: Adding a New Channel

Database

Analytics & dbt

Troubleshooting

Orchestration

Contributing

Support

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages