Senior Cloud Platform Engineer
Building GPU/AI infrastructure at scale · CNCF Golden Kubestronaut · Open Source · Researcher
Senior Cloud Platform Engineer at W.W. Grainger, Inc. with deep expertise in cloud-native GPU/AI infrastructure, Kubernetes ecosystems, and platform engineering. I build open-source tools for GPU workload autoscaling, observability, and topology-aware incident response.
Actively contributing to CNCF projects with 31+ merged PRs across 17+ projects. Published researcher with 20+ articles and papers on AI/ML infrastructure, Kubernetes, and platform engineering.
Recognized as a CNCF Golden Kubestronaut — one of the elite professionals holding all five Kubernetes certifications. Community member of the Dragonfly project and active contributor to Volcano, KEDA, OpenTelemetry, and more.
Volcano GPU NUMA-aware scheduler (3-repo PR), KEDA GPU Scaler, Kube Topology Agent, Dragonfly Community Member, IEEE peer reviewer, HPSF & InfoQ speaker
OTel GPU Receiver, OpenTelemetry docs contributions, Kubernetes website docs PRs, published 6 peer-reviewed papers on AI/ML infrastructure
OpenColorIO release signing & Vulkan tests, OpenCue subscription recalculation, OpenImageIO bug fix, RAWtoACES docs, xSTUDIO links fix
Achieved all 16 CNCF certifications including CKS, CKA, CKAD, KCNA, KCSA plus 11 Golden-tier certs. Published first peer-reviewed papers on Kubernetes and zero-trust infrastructure
One of the elite professionals who have earned all CNCF Kubernetes and Cloud Native certifications — demonstrating comprehensive expertise across the entire cloud-native ecosystem. A highly selective professional designation held by fewer than 400 practitioners globally.
Certified Kubernetes Security Specialist
Certified Kubernetes Administrator
Certified Kubernetes Application Developer
Kubernetes & Cloud Native Associate
Kubernetes & Cloud Native Security Associate
Prometheus Certified Associate
Certified GitOps Associate
Certified Cilium Associate
Certified Argo Project Associate
Istio Certified Associate
Kyverno Certified Associate
OpenTelemetry Certified Associate
Cloud Native Platform Associate
Cloud Native Platform Engineer
Certified Backstage Associate
Linux Foundation Certified SysAdmin
Elected via community governance vote — contributing to AI/ML model distribution, Helm charts, and dragonfly-injector
Active contributor across Volcano, Dragonfly, KEDA, Kubernetes, OpenTelemetry, and more
Contributing to HAMi (Heterogeneous AI Computing Virtualization Middleware) — GPU sharing and virtualization for Kubernetes
Recognized by Oracle for strong technical expertise and community contribution in cloud infrastructure and Kubernetes
Active contributor to CNCF foundation projects — 31+ PRs across 17+ repos
Cloud-native batch scheduling for AI/HPC
P2P file distribution & image acceleration
Production-grade container orchestration
Distributed transactional key-value database
Kubernetes event-driven autoscaling
Observability framework
Bare metal host provisioning for K8s
K8s-native packaging & resource management
Heterogeneous AI Computing Virtualization Middleware
20+ published articles and research papers on Cloud-Native, Kubernetes, AI/ML Operations, and Platform Engineering
Open source tools for GPU autoscaling, observability, and topology-aware infrastructure
Independent repository developing an event-driven GPU autoscaler using KEDA’s External gRPC Scaler interface. Native NVML metrics, DaemonSet deployment, pre-built scaling profiles for vLLM, Triton, and training workloads. Not yet merged into the KEDA core repository.
Referenced in KEDA #7538
OpenTelemetry Collector receiver for NVIDIA GPU metrics. GPU utilization, memory, temperature via NVML. Standard OTLP export with built-in Prometheus exporter.
Kubernetes knowledge graph & automated root-cause analysis. Real-time resource topology, graph-based incident investigation, AlertManager webhook integration.
Kubernetes-native autoscaler for AI inference workloads. Custom scaling algorithms, GPU-focused policies, latency SLA enforcement, Prometheus metrics.
Comprehensive Kubernetes certification study guides covering all CNCF certifications. Interactive quizzes, flashcards, lab exercises, and PDF generation.
Industry publications, foundation blogs, and personal technical writing
How to eliminate GPU budget waste with KEDA external scalers — native NVML metrics, DaemonSet architecture, and scaling profiles for vLLM, Triton, and training workloads.
How P2P mesh architecture eliminates registry bottlenecks in enterprise CI/CD — Dragonfly's distributed caching, bandwidth optimization, and multi-datacenter image distribution at scale.
Why standard HPA fails for LLM inference — token-aware autoscaling, KV cache pressure, GPU memory headroom, and building KEDA scalers for production serving.
Why autonomous infrastructure needs execution guardrails — policy-as-code with Kyverno, blast-radius containment, and building trust boundaries for AI agents in production.
Multi-cluster Argo CD on Oracle Kubernetes Engine — ApplicationSets, cluster secrets, RBAC delegation, and progressive delivery patterns for enterprise GitOps.
Running Docker-based AI agents on Oracle Cloud — Docker Model Runner, GPU shapes, container orchestration, and agent sandboxing on OKE.
How to restore the Golden Path for ML engineers by pushing GPU scaling complexity down the stack — edge-native NVML telemetry, KEDA External Scaler architecture, and eliminating the Prometheus latency trap.
The models are ready but the pipes aren’t — how CI/CD pipelines, GPU scheduling, model distribution, and governance are killing enterprise AI deployments.
Implementing zero-trust security on Oracle Kubernetes Engine with Terraform — IAM policies, network security groups, workload identity, and confidential computing.
Formal verification of ArgoCD manifests — resource invariants, temporal logic, and rollback safety for mission-critical deployments.
Dynamic resource allocation, in-place vertical scaling, and immutability improvements in Kubernetes v1.35 for AI/ML workloads and FinOps.
How AI agents, eBPF, and LLMs are transforming SRE from reactive incident management to autonomous self-healing infrastructure.
Why internal developer platforms need to prioritize the Java ecosystem — bridging enterprise reality with platform engineering ideals.
How ArgoCD v3 evolves from a sync tool to the backbone of modern platform engineering — multi-tenancy, scalability, and GitOps at enterprise scale.
Journey from Java developer to earning all CNCF certifications — what it takes to unlearn and re-learn in the cloud-native world.
How Dragonfly’s P2P architecture accelerates large AI model downloads — HuggingFace and ModelScope integration.
Challenges of scaling agentic AI in telecom infrastructure — cost implications and architectural considerations for autonomous networks.
Conference presentations on cloud-native infrastructure, GitOps, and HPC
Pavan Madduri, W.W. Grainger
Applying ArgoCD, Kubernetes, and GitOps workflows to HPC environments — bridging the gap between cloud-native DevOps and scientific computing.
Pavan Madduri, W.W. Grainger
CI/CD pipelines, testing strategies, and automation for scientific and research software development — making open source science reproducible and maintainable.
Pavan Madduri (Grainger), Rohit Dhawan (Amazon), Alina Astapovich (Storytel), Goutham Rao (NeuBird) · Moderated by Renato Losio (InfoQ)
How AI agents and generative models are being used for incident detection, root cause analysis, and automated remediation — reducing MTTR and operational load at scale.
Quoted as a subject-matter expert across 11+ publications on enterprise AI, GPU infrastructure, cloud security, and platform engineering
“The real dependency risk comes from the orchestration, workflow and data integration layers built around them… Relying on third-party orchestration is where real lock-ins happen.”
“The architecture that works is a routing layer: simple tasks go to a lightweight SLM, complex reasoning escalates to the frontier model. You stop paying frontier prices for envelope-delivery workloads.”
“GPU capacity is genuinely hard to get right now… You can’t buy that institutional knowledge with a convertible note and a rebrand.”
“If an AI agent is trained purely by observing the official workflow in the ticketing platform, it’s learning a fantasy… You have to fence the AI in.”
“We enforce this with Policy-as-Code at the admission layer, so the agent’s available responses are constrained by the infrastructure itself, not by a governance doc that someone wrote once and nobody checks.”
“We are building autonomous agents without implementing Zero Trust security… Regulators must urgently pivot to regulating Agentic Privileges.”
CNCF GPU autoscaling blog featured to 3M+ subscribers — one of the largest daily tech newsletters globally.
Russian-language adaptation of CNCF blog by VKTech (VK/Mail.ru Group) — 4,500+ views in first 13 hours. International reach beyond English-speaking audience.
Primary author — GPU autoscaling architecture, keda-gpu-scaler, and scale-to-zero for AI inference on Kubernetes.
Quoted on enterprise AI agent deployment challenges and the gap between documented processes and operational reality.
“Pavan Madduri breaks down how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly — cutting metric latency from 15–30s to 2–4s.”
204+ likes · 28 reposts · 3 comments
“See how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly, with scaling profiles for vLLM, Triton, and training workloads.”
2,122 views · 24 likes · 7 bookmarks
Featured across all 3 CNCF social platforms — LinkedIn, Twitter/X, and Bluesky.
“From public static void main to Golden Kubestronaut: The Art of Unlearning — Pavan Madduri shares his journey through all five Kubernetes certifications.”
26+ likes · 1 repost
Providing architectural feedback and early platform contributions for enterprise AI agents (e.g., Future AGI). Coordinating technical documentation and letters of support with key open-source project maintainers across CNCF foundations.
Serving as a technical peer reviewer and judge for international IEEE conferences and journals
Peer reviewing submissions on IoT architectures, edge computing, and distributed systems for one of IEEE’s highest-impact journals.
Reviewing research papers on AI systems, electrical engineering, and their intersection with cloud-native infrastructure.
Evaluating papers on cloud computing architectures, container orchestration, and scalable infrastructure design.
Reviewing research on network communications, distributed systems, and telecom infrastructure.
Evaluating submissions on computer networking, cloud infrastructure, and distributed computing systems.
Always open to connecting with fellow engineers in the cloud-native and AI/ML space