I train small language models and study how far they can be pushed with reinforcement learning and post-training. Most of my work sits at the intersection of post-training methods (SFT, RLHF, GRPO, reward modeling) and sub-1B models, testing whether large-model techniques scale down ~70x.
Currently an Analyst at Lowe's (production NLP & ML), doing post-training and RL research on my own time. Top 100 @ HuggingFace x Meta OpenEnv Hackathon (Bangalore).
| Project | Description |
|---|---|
| SHADE-GYM | OpenEnv-native RL gym for AI safety. Trained a 1.5B LoRA monitor via GRPO, improving AUROC from 0.500 → 0.893, closing ~40% of the gap to a frontier model at <0.1% of the cost. |
| post-training-experiments | Domain-adapted SmolLM-135M using QLoRA, achieving −20.1% perplexity and +25.4% ROUGE-L, with a writeup analyzing which post-training interventions actually moved the needle. |
| Diffusion | Built diffusion language models from scratch, including a ~150M ModernBERT-based model and a 45M first-principles implementation. |
| TenderIQ | Production document-evaluation system featuring multi-tier OCR, RAG, and human-in-the-loop review for scalable document analysis. |