Shreeda Bhat
Site Reliability Engineer & Platform Engineer · Kubernetes · AWS · GCP
./experience
- Moved 10 microservices to EKS with Karpenter for spot capacity. Set up blue-green with Argo Rollouts. Nothing broke.
- Built out Datadog from scratch: APM, distributed traces, custom metrics, the works. Before this we were flying blind.
- Defined SLOs and error budgets. Ran K6 load tests at 200 req/s to actually validate them, not just write them down.
- Pushed GitOps with ArgoCD hard. CI/CD failures dropped 45%.
- Moved Kafka off managed and onto self-hosted Strimzi (TLS+SCRAM). Messaging costs dropped 30%.
- Kept 200+ microservices at 99.99% uptime with 5M+ users online simultaneously. There were some close calls.
- Moved 200+ microservices from AWS to multi-region GCP live, no maintenance window. 20% cheaper, 30% faster for India and Brazil.
- Wrote a Python controller that fed active session counts to KEDA so the autoscaler stopped evicting pods mid-game. Cloud spend dropped 25-40%.
- Ran an Elasticsearch cluster (10 master + 30 data nodes) eating 100TB of logs a day. Keeping that healthy was a full-time job.
- Set up ArgoCD app-of-apps with ApplicationSets, OPA/Gatekeeper, and RBAC across 3 regions. Deployments went from painful to boring, which was the goal.
- Built the gameserver provisioning platform from scratch: 32+ Terraform modules, Ansible, Packer. Spun a server in 15 min instead of a few hours.
- Built a BGP Anycast edge across 9 data centres. TTFB under 50ms for 1M+ storefronts. No GeoDNS tricks, just routing.
- Moved the whole platform off AWS to bare metal while it was live. Used BGP dual-announcement with BIRD so traffic shifted gradually. Bill went from $80K to $5K a month.
./skills
./projects
AI-powered QA automation platform — web dashboard, Chrome extension, Electron desktop app, VS Code extension and MCP server. Django + Celery backend, Remix frontend, fully Dockerised with CI/CD via Cloudflare tunnel.
Testing framework for Terraform infrastructure code — write unit tests for modules and catch drift before it hits prod.
Full ARR media stack running on my homelab — containerised, self-hosted, zero cloud spend.
Fully automated Mac setup via Ansible playbooks — new machine to fully configured in one command.
Full-featured recipe API built with Django REST Framework — auth, filtering, image uploads, fully Dockerised.
Git made simple for lazy developers — friendly CLI wrappers around common git workflows so you stop googling the same commands.
Exploring Kubernetes orchestration via Julia's kuber.jl — managing cluster resources programmatically from a Julia runtime.
Auth API built with Julia, Bukdu.jl and HTTP.jl — proving Julia isn't just for scientific computing.
VPC with public/private subnets across two AZs, security groups scoped between tiers — reference AWS network layout.
./testimonials
./on_the_web
Finally done with messy wires and pfsense setup pic.twitter.com/Mkl2pwqlFl
— Shreeda Bhat (@bhat_shreeda) April 28, 2025
Sunday is a funday!! Trying to learn OS level virtualization in lxd, and it's so fun pic.twitter.com/oce7gRglvX
— Shreeda Bhat (@bhat_shreeda) January 23, 2022
./systems
Python controller that watches active session counts and feeds them to KEDA as custom metrics. Stops pods from getting killed while users are mid-game. Cloud spend dropped 25-40%. Before this, the autoscaler was just guessing.
→ Read writeupOn-demand game server provisioning across GCP regions using 32+ Terraform modules, Ansible, and Packer. Used to take a few hours per server, now takes 15 minutes. Not glamorous work, but the team stopped waiting on infra.
→ Read writeupAnycast edge across 9 data centres using BGP (BIRD). Same /24 announced from every PoP. Users hit their nearest server automatically. Migrated off AWS while traffic was running. $80K/month down to $5K. TTFB under 50ms everywhere.
→ Read writeup./blog
How I moved a 1M+ storefront platform off AWS using BGP Anycast, BIRD, k3s, and MetalLB while traffic was live. The biggest cost cut I've ever shipped, by a lot.
Moving 200+ production services between clouds without dropping a request. The strategy, the tooling, a Terraform state nightmare, and what I'd do differently.
The default autoscaler was evicting pods mid-game. We wrote a Python controller that fed session counts to KEDA so it knew which pods were actually safe to kill. Cloud spend down 40%.
./contact
Response time: usually within 24 hours