A production-ready Retrieval-Augmented Generation (RAG) system for healthcare document retrieval, built with FastAPI, ChromaDB, and OpenAI integration.
This project demonstrates a scalable RAG pipeline designed for healthcare applications. It combines semantic search using sentence transformers with LLM-based answer generation to provide accurate, context-aware responses to medical queries.
- Dual Vector Store Support: ChromaDB (primary) with FAISS fallback for high availability
- Production-Grade Architecture: FastAPI with async support, structured logging, health checks, and monitoring
- LLM Integration: OpenAI GPT-4o-mini for context-aware answer generation
- Containerized Deployment: Docker + Docker Compose with multi-stage builds
- Enterprise Configuration: Environment-based settings with Pydantic validation
- Performance Optimized: Response time tracking, efficient embeddings, automatic retries
- Framework: FastAPI 0.115.0
- Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
- Vector Stores: ChromaDB 0.4.24, FAISS 1.9.0
- LLM: OpenAI GPT-4o-mini
- Server: Uvicorn with uvloop
- Containerization: Docker, Docker Compose
┌─────────────┐
│ Client │
└──────┬──────┘
│
▼
┌─────────────────────────────────┐
│ FastAPI Server │
│ ┌──────────────────────────┐ │
│ │ Middleware & Logging │ │
│ └──────────────────────────┘ │
│ ┌──────────────────────────┐ │
│ │ RAG Endpoint │ │
│ │ - Query validation │ │
│ │ - Embedding generation │ │
│ └──────────────────────────┘ │
└────────┬────────────────┬───────┘
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ChromaDB │ │ FAISS │
│(Primary)│ │(Fallback)│
└─────────┘ └─────────┘
│ │
└────────┬───────┘
▼
┌────────────────┐
│ OpenAI API │
│ (GPT-4o-mini) │
└────────────────┘
- Docker & Docker Compose
- Python 3.10+ (for local development)
- OpenAI API key (optional, for answer generation)
- Clone the repository
git clone <repository-url>
cd RAG_healthcare- Configure environment variables
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY- Launch services
docker-compose up --build- Test the API
curl -X POST "http://localhost:8000/rag" \
-H "Content-Type: application/json" \
-d '{"query": "What is metformin used for?", "k": 2}'# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the application
uvicorn rag_app:app --reload --port 8000Retrieve relevant documents and generate an answer.
Request:
{
"query": "What is metformin used for?",
"k": 2
}Response:
{
"query": "What is metformin used for?",
"retrieved_contexts": [
"Metformin is commonly prescribed as a first-line therapy for type 2 diabetes."
],
"num_results": 1,
"answer": "Metformin is commonly prescribed as a first-line therapy for type 2 diabetes. It helps control blood sugar levels in patients with this condition.",
"processing_time_ms": 245.32,
"timestamp": "2025-10-26T10:30:45.123456"
}Health check endpoint for monitoring.
Response:
{
"status": "healthy",
"version": "1.0.0",
"timestamp": "2025-10-26T10:30:45.123456",
"index_size": 4
}Interactive API documentation (Swagger UI).
Alternative API documentation (ReDoc).
Environment variables are managed via .env and validated through config.py:
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
- | OpenAI API key for answer generation |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 | Sentence transformer model |
OPENAI_MODEL |
gpt-4o-mini | OpenAI model for answers |
USE_CHROMADB |
true | Use ChromaDB (falls back to FAISS) |
CHROMA_HOST |
chromadb | ChromaDB service hostname |
CHROMA_PORT |
8000 | ChromaDB service port |
DEBUG |
false | Enable debug mode |
LOG_LEVEL |
INFO | Logging level |
RAG_healthcare/
├── rag_app.py # Main FastAPI application
├── config.py # Configuration management
├── requirements.txt # Python dependencies
├── Dockerfile # Multi-stage production build
├── docker-compose.yml # Service orchestration
├── .env # Environment variables
├── data/ # Document storage (optional)
└── README.md # This file
- ChromaDB: Primary choice for persistent, production-grade vector storage with HTTP client support
- FAISS: In-memory fallback for development and high-availability scenarios
- Automatic failover with retry logic ensures continuous operation
- Sentence transformers (all-MiniLM-L6-v2) for efficient, high-quality embeddings
- 384-dimensional vectors balance performance and accuracy
- Batch processing support for scaling to larger document sets
- Context-aware prompts with retrieved documents
- Temperature-controlled generation (0.7) for balanced creativity
- Graceful degradation: returns retrieval results even if LLM fails
- Request/response timing middleware with
X-Process-Timeheaders - Global exception handling with debug-aware error messages
- Structured logging with timestamp, severity, and context
- Health checks compatible with Kubernetes/Docker Swarm
- Non-root container user for security
- Multi-stage Docker builds for minimal image size
- Embedding Generation: ~50-100ms (sentence-transformers)
- Vector Search: ~5-20ms (ChromaDB/FAISS)
- LLM Generation: ~200-500ms (OpenAI API)
- Total Response Time: ~250-600ms end-to-end
- API keys loaded from environment (never committed)
- Non-root container execution
- CORS middleware with configurable origins
- Input validation with Pydantic
- Request size limits and query length constraints
- Health check endpoints don't expose sensitive data
- Stateless API design allows multiple replicas
- ChromaDB supports distributed deployment
- Load balancer (Nginx/Traefik) for traffic distribution
- Increase uvicorn workers (CPU-bound)
- Optimize embedding model size vs. accuracy
- Batch processing for bulk queries
- Add Pinecone/Milvus for cloud-native vector storage
- Implement caching layer (Redis) for frequent queries
- Add BM25 hybrid search for keyword matching
- Integrate reranking models for improved relevance
- Implement rate limiting and API key authentication
- Add unit tests (pytest + pytest-asyncio)
- Implement query caching
- Add document upload endpoint
- Support multiple document formats (PDF, DOCX)
- Implement user authentication
- Add metrics and monitoring (Prometheus)
- Create evaluation pipeline (ROUGE, BLEU)
MIT License - see LICENSE file for details
For questions or collaboration opportunities, please open an issue or reach out via GitHub.
Built with FastAPI • ChromaDB • OpenAI • Docker