A comprehensive MLOps solution for deploying, serving, and monitoring fine-tuned LLMs.
This project demonstrates an end-to-end LLM deployment with a focus on:
- Serving fine-tuned Llama 3.2 1B models with vLLM
- Backend API with LangChain for structured LLM interactions
- Frontend interface with Gradio
- Comprehensive monitoring with Prometheus and Grafana
- Log aggregation with Loki
- vLLM API: Serves the Llama 3.2 1B base model and custom LoRA adapters
- Backend: FastAPI service that handles prompting, model selection, and response formatting
- Frontend: Gradio web interface for easy interaction with the models
- Monitoring: Prometheus, Grafana, and Loki for observability
- Sentiment Analysis: Analyzes text sentiment using a fine-tuned model
- Medical QA: Answers medical multiple-choice questions with domain-specific tuning
-
Set up the network:
docker network create aio-network
-
Start the monitoring stack:
cd monitor docker compose up -d -
Launch the vLLM API server:
cd vllm_api docker compose up -d -
Start the backend API:
cd backend docker compose up -d -
Launch the frontend application:
cd frontend docker compose up -d
- vLLM API:
http://localhost:8000 - Backend API:
http://localhost:8001 - Gradio UI:
http://localhost:7861 - Open WebUI:
http://localhost:8080 - Grafana:
http://localhost:3000 - Prometheus:
http://localhost:9090
export OPENAI_API_KEY=<your vllm api key>
make bench_serving