AI/ML Engineer Building production LLM systems | Optimizing for cost, latency, and reliability
Current Focus: Fine-tuning specialized models for agent architectures (QLoRA, Unsloth) and shipping hybrid ReAct/Reflexion systems that balance speed vs correctness.
Target Roles: AI/ML Engineer, LLM Engineer, and Agent Systems Engineer
Key Impact & Competencies:
- Production Engineering: Reduced LLM costs by 73% via prompt optimization and model routing strategies.
- System Design: Architected hybrid agent systems handling 2,000+ production runs with <5% parser failures.
- Cost-Performance Optimization: Extensive benchmarking of latency, cost, and reliability to drive architectural decisions.
- Full-Stack ML: Expertise in Fine-tuning (QLoRA), RAG (AST-aware chunking), and Infrastructure (Docker, FastAPI).
Goal: Replace GPT-4 with a specialized 8B model for agent tool routing. View Project
- Problem: AI agents were incurring high costs ($0.15/call) and latency using GPT-4 for simple tool selection tasks.
- Solution: Fine-tuned Llama-3.1-8B (QLoRA) on 5,000 synthetic examples with strict Pydantic schemas. Implemented anti-hallucination constraints and 20% token compression via null-stripping.
- Impact: Achieved 10x cost reduction and 5x faster inference compared to the baseline.
Goal: Create a "dual-brain" agent switching between fast execution and self-correction. View Project
- Innovation: Multi-model orchestration using Llama 3.3 for speed and Gemini 2.5 for evaluation.
- Production Wins: Developed a battle-hardened parser to handle markdown fences and malformed JSON, resulting in ** <5%** parser crashes** across 2,000+ runs.
- Optimization: Implemented strict tool output truncation (reducing 50k to 5k tokens) and exponential backoff for API resilience.
Goal: Improve RAG accuracy for code generation tasks. View Project
- Problem: Naive chunking split code at arbitrary boundaries, breaking function context and causing hallucinations.
- Solution: Implemented AST-based chunking to treat functions, classes, and methods as atomic units.
- Impact: Increased Q&A accuracy from 45% to 82%.
Goal: Group 10K+ customers by spending behavior for targeted marketing.
- Challenge: Initial clustering models yielded a low Silhouette score (0.32) due to outliers.
- Solution: Applied Winsorization and advanced feature engineering.
- Result: improved Silhouette score to 0.58; pipeline processes 10,000 records in 2.3 seconds.
- Machine Learning: PyTorch, Transformers (Hugging Face), QLoRA, Unsloth, LangChain, RAG
- LLMs: Llama 3 (Fine-tuning), GPT-4 (API), Gemini 2.5, Groq (Inference)
- Infrastructure: Docker, FastAPI, PostgreSQL, Redis (Semantic Caching)
- Tools: Git, Linux, VSCode, Jupyter, Pytest
I focus on production-ready systems, moving beyond notebook demos.
- Fail Gracefully: LLMs hallucinate. Systems must be designed with Pydantic schemas, hardened parsers, and fallback strategies.
- Data Quality > Model Size: A 7B model on clean data outperforms GPT-4 on poor data. I prioritize data pipelines and cleaning.
- Cost as a Metric: API tokens are treated as budget items. Every project includes a cost-performance benchmark.
- Build, Measure, Learn: I prioritize shipping v0.1 quickly to gather usage data and iterate based on production failures rather than theory.
Active Areas of Interest:
- Mentorship: Scaling ML systems (10M+ records), distributed training, and Kubernetes.
- Open Source: Contributing to LLM tooling (LangChain, Unsloth, vLLM).
- Learning: Distributed inference (vLLM, Ray), vector databases (Pinecone, Weaviate), and LLM observability (LangSmith, Weights & Biases).