Sooraj rsoorajs

🧠 About Me

I am an AI Engineer based in Dubai. I build LLM powered products and take them from the first architecture decision through to a running production system.

🤖 I work as a Generative AI Engineer at Gama Security Systems, where I build RAG systems, agent workflows, and copilot features.
🏗️ I handle the full delivery of AI features. That covers retrieval architecture, fine tuning, evaluation pipelines, and keeping cost and latency under control once things are live.
⚡ I am comfortable starting work before the requirements are fully clear. I pick the tools, make a call, and ship something that works.
🧬 I recently built DeepSeek V2 Lite, a 15.7B parameter Mixture of Experts model, from scratch in PyTorch.
🌱 I follow new work in LLMs, AI agents, and inference optimization closely.

🛠️ Tech Stack

🤖 AI & Machine Learning

☁️ MLOps & Cloud

🗄️ Backend & Data

🎨 Frontend

🚀 Featured Projects

🧬 An LLM System Built for Production: Custom Transformer with MLA and MoE

I built DeepSeek V2 Lite from scratch in PyTorch. It is a 15.7B parameter Mixture of Experts model that keeps 2.4B parameters active per token.

🔬 I implemented Multi-head Latent Attention, which cuts the KV cache by 86%. The model runs a 64 expert MoE with top 6 routing and RoPE positional encoding with YaRN scaling. I checked every layer against the HuggingFace reference to confirm the numbers matched.
🧮 I applied INT8 quantization to bring inference memory from 31GB down to around 16GB, so the model runs on consumer hardware. The critical layers stay in full precision to protect output quality.
🚀 I built an OpenAI compatible streaming inference server with FastAPI and SSE. Redis Streams sits behind it as a token bus, so clients can reconnect and replay from their last position, and the server scales without sticky sessions.
☁️ I deployed it as three services on AWS. Separating the code containers from the model weights, then using SageMaker Async Inference with scale to zero, brought the monthly hosting cost from around 730 dollars down to around 12 dollars.

PyTorch FastAPI LangGraph AWS SageMaker Docker Terraform Redis Streams

🤝 A RAG Personal Assistant with Agentic Reasoning

A personal assistant that uses hybrid retrieval and the Model Context Protocol, and handles multi step reasoning on its own.

🎯 I built a hybrid RAG pipeline that combines dense embeddings, sparse BM25, and a cross encoder for reranking. Retrieval accuracy went from 0% to 100% on a curated evaluation set.
🧠 The assistant runs a ReAct agent on LangGraph. It picks from more than 16 tools through the Model Context Protocol and reasons across several steps.
🔌 Every swappable part sits behind an abstract interface. I can A/B test different RAG setups by changing environment variables, with no code changes.
🛠️ It runs on AWS with EC2 and ECR, Docker for containers, Nginx for SSL, and a GitHub Actions pipeline that tests and deploys on every merge.

LangGraph OpenAI Qdrant MCP Docker AWS Nginx

💼 Experience

Role	Company	Period
🤖 Generative AI Engineer	Gama Security Systems, Dubai 🇦🇪	`May 2023 to Present`
💻 Full Stack Developer	Allianz Technology, India 🇮🇳	`Apr 2021 to Apr 2023`
🛠️ Junior Software Engineer	Infinite Open Source Solutions, India 🇮🇳	`Jun 2019 to Apr 2021`

🎓 Education and Certifications

🎓 M.Sc. in Computer Science, Chandigarh University, India. 2023 to 2025
📜 Supervised Machine Learning: Regression and Classification, from Stanford University.
📜 DevOps Beginners to Advanced, from Udemy.
📜 Microsoft .NET Fundamentals, from Microsoft.
📜 Programming with Python, from Harvard University.

📊 GitHub Analytics

🐍 Contribution Snake

🤝 Let's Connect

I am open to interesting AI and ML problems, and I am happy to talk about building with LLMs.

If you are hiring for AI engineering work, or you just want to compare notes, get in touch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly