A FastAPI-based REST API service that integrates Mistral LLM via Ollama to answer user queries using a predefined knowledge base, with comprehensive query logging and retrieval functionality.
This is a production-ready FAQ assistant that implements a RAG (Retrieval-Augmented Generation) pipeline to answer customer questions using a local Mistral LLM model. The service combines semantic search with keyword matching to provide accurate, contextually relevant answers from a structured knowledge base.
Development Approach: This project demonstrates strategic AI tool utilization throughout the software development lifecycle, showcasing how different LLMs can be methodically employed for planning, implementation, quality assurance, and production enhancement.
- REST API with FastAPI framework
- Mistral LLM integration via Ollama for local, privacy-focused inference
- Advanced RAG Implementation with hybrid vector similarity and BM25 retrieval
- SQLite persistence with SQLAlchemy ORM for query/response logging
- Docker support for containerized deployment
- Comprehensive logging and error handling
- Environment-aware configuration (auto-detects Docker vs local development)
The service implements a sophisticated RAG pipeline that combines:
- Semantic Search: Vector embeddings using Ollama's
nomic-embed-textmodel - Keyword Search: BM25 retrieval for exact term matching
- Hybrid Ensemble: Weighted combination (60% semantic, 40% keyword) for optimal results
- Smart URL Detection: Automatically configures Ollama endpoints for Docker or local environments
- Framework: FastAPI with async/await support
- LLM: Mistral via Ollama (local inference)
- Vector Store: ChromaDB for embeddings persistence
- Database: SQLite with SQLAlchemy ORM
- Retrieval: LangChain with EnsembleRetriever
- Containerization: Docker with optimized uv package manager
This implementation meets all the specified requirements:
- POST /ask: Accepts queries and returns LLM-generated answers
- GET /history: Retrieves past queries with configurable limit (default: 10)
- Mistral LLM integration via Ollama for local inference
- RAG implementation with semantic and keyword search
- Context injection with structured prompts
- Factual assistant prompt designed for accurate FAQ responses
- Context-aware responses using retrieved knowledge base segments
- Fallback handling for out-of-scope queries
- SQLite database with SQLAlchemy ORM
- Query/response logging with automatic timestamps
- History retrieval with configurable limits
- Comprehensive logging across all modules
- Error handling with proper HTTP status codes
- Docker support with production-ready Dockerfile
- Install Ollama from ollama.com
- Pull required models:
ollama pull mistral ollama pull nomic-embed-text
- Start Ollama server (important for Docker):
$env:OLLAMA_HOST="0.0.0.0:11434"; ollama serve
-
Clone and setup:
git clone <repository-url> cd faq-rag-service
-
Create and activate virtual environment:
# Using venv python -m venv .venv # Activate on Windows .venv\Scripts\activate # Activate on Linux/Mac source .venv/bin/activate
-
Install dependencies:
# Using pip pip install -r requirements.txt # Or using uv (faster, recommended) uv pip install -r requirements.txt
-
Verify Ollama is running:
ollama list # Should show mistral and nomic-embed-text models -
Run the service:
uvicorn src.main:app --reload --port 8000
-
Access endpoints:
- API Documentation: http://localhost:8000/docs
- Service Status: http://localhost:8000/
-
Build the optimized image:
docker build -t faq-rag-service . -
Run the container:
docker run -d -p 8000:8000 --name faq-rag faq-rag-service
Important: Ensure Ollama is running with OLLAMA_HOST="0.0.0.0:11434" before starting the container.
Submit a question and receive an AI-generated answer.
Request:
{
"question": "What is your refund policy?"
}Response:
{
"answer": "Month-to-month plans are non-refundable, but if we miss our SLA you'll receive automatic service credits. Annual upfront plans may be cancelled within 30 days for a prorated refund minus used-month charges."
}Retrieve previous questions and answers with timestamps.
Query Parameters:
limit: Integer (default: 10) - Number of recent interactions to return
Response:
[
{
"question": "What is your refund policy?",
"answer": "Month-to-month plans are non-refundable...",
"timestamp": "2025-07-22T01:30:00"
}
]Alternative Parameter: The API also supports n parameter for compatibility:
GET /history?n=35
Service status and available endpoints.
Response:
{
"message": "FAQ-RAG Service is running",
"endpoints": ["/ask", "/history"],
"status": "active"
}The service uses a structured Q&A format stored in data/knowledge_base.txt:
Q: What is your refund policy?
A: Month-to-month plans are non-refundable, but if we miss our SLA you'll receive automatic service credits as outlined above. Annual upfront plans may be cancelled within 30 days for a prorated refund minus used-month charges.
---
Q: How can I contact support?
A: You can reach our support team via email at support@cloudsphere.com or call our hotline at +1 800 555-0199. Our technical support hours are Monday through Friday, 9 AM to 6 PM EST.
The CloudSphere knowledge base was created through a strategic multi-stage AI process:
Three different prompt strategies were developed and evaluated:
- Comprehensive coverage approach: Broad FAQ categories with balanced complexity
- Customer journey mapping: Questions organized by user lifecycle stages
- Support ticket analysis: FAQ based on common support scenarios
The selected prompt strategy was implemented to generate the final knowledge base:
You are creating a comprehensive FAQ knowledge base for a mid-sized technology company that provides cloud services and software solutions. Generate 25-30 FAQ pairs that cover:
1. Product information and features
2. Pricing and billing questions
3. Technical support and troubleshooting
4. Account management and security
5. Integration and API questions
6. Service availability and maintenance
7. Data privacy and compliance
8. Refund and cancellation policies
For each FAQ pair, use this exact format:
Q: [Question that a real customer might ask]
A: [Detailed, helpful answer that provides clear guidance and next steps where appropriate]
Make the questions varied in complexity - include both simple ("What are your business hours?") and more complex technical questions.
This multi-stage approach demonstrates how different AI tools can be strategically combined for optimal content quality - using Claude's analytical strength for prompt design and ChatGPT's content generation capabilities for the final output.
# Run unit tests
pytest
# Run API tests with verbose output
pytest tests/test_api.py -v
# Run specific test modules
pytest tests/test_parser.py
pytest tests/test_rag.pyUsing curl (Linux/Mac):
# Test the ask endpoint
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is your refund policy?"}'
# Test the history endpoint
curl http://localhost:8000/history?limit=5Using PowerShell (Windows):
# Test the ask endpoint
Invoke-RestMethod -Uri http://localhost:8000/ask -Method Post -ContentType "application/json" -Body '{"question": "What is your refund policy?"}'
# Test the history endpoint
Invoke-RestMethod -Uri http://localhost:8000/history?limit=5Create and activate virtual environment:
# Create virtual environment
python -m venv .venv
# Activate on Linux/macOS
source .venv/bin/activate
# Activate on Windows
.venv\Scripts\activateInstall dependencies with uv (recommended):
# Install project in editable mode with all dependencies
uv pip install -e .
# Or install from requirements.txt
uv pip install -r requirements.txtUnit Testing:
# Run all tests
pytest tests
# Run with verbose output
pytest tests -v
# Run specific test file
pytest tests/test_api.py
# Run with coverage
pytest tests --cov=srcCode Quality:
# Linting (removing unnecessary imports, etc.)
ruff check --fix src tests
# Code formatting
ruff format src tests
# Both linting and formatting
ruff check --fix src tests && ruff format src testsDevelopment Server:
# Standard uvicorn
uvicorn src.main:app --reload --port 8000
# Using explicit Python path (if environment issues)
python -m uvicorn src.main:app --reload --port 8000For a complete development setup from scratch:
# 1. Clone and navigate
git clone <repository-url>
cd faq-rag-service
# 2. Setup environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
# 3. Install dependencies
uv pip install -e .
# 4. Verify setup
pytest tests
ruff check src tests
# 5. Start development server
uvicorn src.main:app --reload --port 8000faq-rag-service/
βββ src/ # Core application code
β βββ __init__.py # Package initialization
β βββ main.py # FastAPI app initialization
β βββ routes.py # API endpoint definitions
β βββ models.py # SQLAlchemy database models
β βββ schemas.py # Pydantic request/response models
β βββ database.py # Database configuration
β βββ crud.py # Database operations
β βββ parser.py # Knowledge base parser
β βββ rag.py # RAG service implementation
βββ tests/ # Unit and integration tests
β βββ __init__.py # Package initialization
β βββ test_api.py # API endpoint testing
β βββ test_crud.py # CRUD operations testing
β βββ test_parser.py # Parser testing
β βββ test_rag.py # RAG service testing
βββ experiments/ # Research and comparison scripts
βββ data/ # Knowledge base and configurations
β βββ knowledge_base.txt # FAQ knowledge base
β βββ knowledge_base_prompt.txt # Knowledge base prompts
βββ docs/ # Project documentation
β βββ assignment_sft.txt # Assignment specifications
βββ results/ # Test results and analytics
β βββ README.md # Results documentation
βββ requirements.txt # Python dependencies
βββ Dockerfile # Container configuration
βββ pytest.ini # Pytest settings
βββ .gitignore # Git ignore rules
βββ README.md # This documentation
The service automatically detects its environment and configures Ollama endpoints accordingly:
- Local Development: Uses
http://localhost:11434 - Docker Container: Uses
http://host.docker.internal:11434
This is handled by the get_ollama_base_url() function in src/rag.py.
Database:
- SQLite database automatically created as
./test.db - Vector store persisted in
./chroma_db/
Retrieval Parameters:
- Semantic retriever: Top 3 results
- BM25 retriever: Top 3 results
- Ensemble weights: 60% semantic, 40% keyword
LLM Settings:
- Model:
mistral - Temperature: 0.3 (for consistent responses)
- Embedding model:
nomic-embed-text
The Dockerfile includes several optimizations:
- uv package manager: 30% faster installations compared to pip
- Multi-layer caching: Optimizes rebuild times
- Minimal base image: Python 3.11-slim for reduced size
- With uv: ~38.5s build time, 710MB image
- With pip: ~53.6s build time, 862MB image
- Non-root user: Security best practice
- Health checks: Built-in endpoint monitoring
- Environment variable support: Flexible configuration
- Proper signal handling: Graceful shutdowns
This project demonstrates a methodical approach to leveraging multiple AI tools throughout the software development lifecycle, showcasing how different LLMs can be strategically employed for their unique strengths.
Initial consultation and education approach:
- Day-by-day development plan: Created a structured timeline breaking down RAG system components
- Interactive learning sessions: In-depth discussions about FastAPI endpoints, CRUD operations, and database design patterns
- RAG methodology analysis: Comprehensive evaluation of different retrieval approaches with practical examples
- Architecture decision making: Guided comparison of vector search vs. hybrid approaches
This phase was crucial for building deep understanding rather than just copying code - similar to working with a senior architect who explains the 'why' behind each decision.
Core development and function completion:
- Function implementation: Real-time assistance in completing class methods and API handlers
- Code logic refinement: Interactive debugging and optimization of algorithmic components
- Integration patterns: Seamless connection between FastAPI routes, database operations, and RAG pipeline
- Type hint completion: Ensuring production-ready code with proper typing throughout
ChatGPT ΞΏ4-mini-high proved excellent for rapid iteration during active coding sessions, providing immediate context-aware suggestions.
Production readiness and comprehensive testing:
- Agent-based code review: Automated identification and resolution of compatibility issues between modules
- Test suite generation: Complete pytest implementation covering unit tests, integration tests, and API endpoints
- Documentation creation: Professional README structure with comprehensive API documentation
- Docker containerization: Production-ready Dockerfile with optimization best practices
- Configuration management: Environment detection and proper .gitignore setup
The agent-based approach was particularly powerful for systematic code quality improvements across the entire codebase.
Repository-wide optimization via GitHub integration:
- Codebase analysis: Direct repository import for comprehensive code review
- Logging enhancement: Strategic placement of logging statements throughout the application
- Error handling improvements: Robust exception handling with proper HTTP status codes
- Pull request workflow: Professional code review process with detailed improvement suggestions
First experience with Codex demonstrated the power of repository-level analysis for identifying patterns and improvements.
Multi-stage content generation process:
- Prompt engineering (Claude Opus 4): Created three different approaches for FAQ generation
- Content creation (ChatGPT o3): Selected optimal prompt strategy and generated comprehensive CloudSphere knowledge base
- Quality validation: Ensured realistic SaaS company scenarios with appropriate technical depth
| Development Phase | Primary Tool | Rationale |
|---|---|---|
| Planning & Learning | Claude Opus 4 | Superior explanatory capabilities and architectural thinking |
| Active Coding | ChatGPT o4-mini-high | Fast, context-aware code completion during development |
| Quality & Integration | GitHub Copilot + Sonnet 4 | Agent-based systematic improvements across entire codebase |
| Production Polish | ChatGPT Codex | Repository-level analysis and enterprise-grade enhancements |
- Tool specialization: Each AI excels in specific development phases
- Learning vs. doing: Use conversational AI for education, coding AI for implementation
- Quality at scale: Agent-based tools excel at systematic improvements across large codebases
- Repository integration: Modern AI tools can understand entire project context for better suggestions
This methodical approach to AI tool utilization resulted in faster development cycles while maintaining code quality and comprehensive understanding of the implemented solutions.
- Average response time: 8-95 seconds (depending on query complexity and CPU/GPU speed)
- First query may be slower due to model loading
- Subsequent queries benefit from warm model cache
- Stateless design: Horizontal scaling ready
- Database connection pooling: Efficient resource usage
- Persistent vector store: Faster startup after initial indexing
- Configurable retrieval: Tunable performance vs accuracy
- Model caching: Ollama handles efficient model memory management
- Vector store: ChromaDB provides optimized similarity search
- Database: SQLite suitable for moderate traffic loads
- Pydantic schemas: Automatic request validation
- SQL injection prevention: SQLAlchemy ORM protection
- Error sanitization: Safe error messages without internal details
- Structured logging: Configurable levels across all components
- Request tracking: Full request/response logging for debugging
- Health checks: Built-in service status monitoring
- Performance metrics: Response time and error rate tracking
- Environment variables: Externalized configuration
- Graceful shutdowns: Proper SIGTERM handling
- Resource limits: Container memory and CPU constraints
- Network security: Configurable CORS and allowed origins
1. "Connection refused" errors:
- Ensure Ollama is running:
ollama listshould show models - For Docker: Start Ollama with
$env:OLLAMA_HOST="0.0.0.0:11434"; ollama serve - Check firewall settings if running in containers
2. "ModuleNotFoundError: No module named 'rank_bm25'":
- This dependency is included in requirements.txt
- Rebuild Docker image if using containers
3. Slow response times:
- First query loads the model (normal behavior)
- Check system resources during LLM inference
- Consider using faster embedding models for production
4. Database errors:
- Database is auto-created on first run
- Check write permissions in application directory
- Review logs for SQLAlchemy connection issues
# Check Ollama status
ollama list
curl http://localhost:11434/api/tags
# View service logs
docker logs faq-rag-service
# Test database connection
python -c "from src.database import engine; print(engine.url)"
# Verify knowledge base parsing
python -c "from src.parser import parse_knowledge_base; docs = parse_knowledge_base(); print(f'Loaded {len(docs)} documents')"