Pavan Madduri

Pavan Madduri

Senior Cloud Platform Engineer

Building GPU/AI infrastructure at scale · CNCF Golden Kubestronaut · Open Source · Researcher

Golden Kubestronaut CNCF Contributor 20+ Publications

About Me

Senior Cloud Platform Engineer at W.W. Grainger, Inc. with deep expertise in cloud-native GPU/AI infrastructure, Kubernetes ecosystems, and platform engineering. I build open-source tools for GPU workload autoscaling, observability, and topology-aware incident response.

Actively contributing to CNCF projects with 31+ merged PRs across 17+ projects. Published researcher with 20+ articles and papers on AI/ML infrastructure, Kubernetes, and platform engineering.

Recognized as a CNCF Golden Kubestronaut — one of the elite professionals holding all five Kubernetes certifications. Community member of the Dragonfly project and active contributor to Volcano, KEDA, OpenTelemetry, and more.

0
Open Source PRs
0
Projects Contributed
0
Publications
0
CNCF Certifications

GitHub Activity

GitHub Contribution Heatmap
GitHub Streak GitHub Stats

Open Source Journey

2026 — Present

GPU NUMA Topology & AI Infrastructure

Volcano GPU NUMA-aware scheduler (3-repo PR), KEDA GPU Scaler, Kube Topology Agent, Dragonfly Community Member, IEEE peer reviewer, HPSF & InfoQ speaker

VolcanoKEDADragonflyIEEEHPSF
2025

Cloud-Native Observability & Platform Engineering

OTel GPU Receiver, OpenTelemetry docs contributions, Kubernetes website docs PRs, published 6 peer-reviewed papers on AI/ML infrastructure

OpenTelemetryKubernetesResearch
2024

GPU/AI Infrastructure Contributions

OpenColorIO release signing & Vulkan tests, OpenCue subscription recalculation, OpenImageIO bug fix, RAWtoACES docs, xSTUDIO links fix

OpenColorIOOpenCueOpenImageIORAWtoACESxSTUDIO
2023

Golden Kubestronaut & Certification Journey

Achieved all 16 CNCF certifications including CKS, CKA, CKAD, KCNA, KCSA plus 11 Golden-tier certs. Published first peer-reviewed papers on Kubernetes and zero-trust infrastructure

CNCFKubestronautCertifications

Achievements

CNCF Golden Kubestronaut

One of the elite professionals who have earned all CNCF Kubernetes and Cloud Native certifications — demonstrating comprehensive expertise across the entire cloud-native ecosystem. A highly selective professional designation held by fewer than 400 practitioners globally.

Kubestronaut Core Certifications

CKS

Certified Kubernetes Security Specialist

CKA

Certified Kubernetes Administrator

CKAD

Certified Kubernetes Application Developer

KCNA

Kubernetes & Cloud Native Associate

KCSA

Kubernetes & Cloud Native Security Associate

Golden Kubestronaut Certifications

PCA

Prometheus Certified Associate

CGOA

Certified GitOps Associate

CCA

Certified Cilium Associate

CAPA

Certified Argo Project Associate

ICA

Istio Certified Associate

KCA

Kyverno Certified Associate

OTCA

OpenTelemetry Certified Associate

CNPA

Cloud Native Platform Associate

CNPE

Cloud Native Platform Engineer

CBA

Certified Backstage Associate

LFCS

Linux Foundation Certified SysAdmin

Community Recognition

Dragonfly Community Member

Elected via community governance vote — contributing to AI/ML model distribution, Helm charts, and dragonfly-injector

CNCF Contributor

Active contributor across Volcano, Dragonfly, KEDA, Kubernetes, OpenTelemetry, and more

HAMi Contributor

Contributing to HAMi (Heterogeneous AI Computing Virtualization Middleware) — GPU sharing and virtualization for Kubernetes

Oracle ACE Associate

Recognized by Oracle for strong technical expertise and community contribution in cloud infrastructure and Kubernetes

Original Open Source Contributions

Active contributor to CNCF foundation projects — 31+ PRs across 17+ repos

CNCF (Cloud Native Computing Foundation)

Volcano

Cloud-native batch scheduling for AI/HPC

Dragonfly

P2P file distribution & image acceleration

Community Member

Kubernetes

Production-grade container orchestration

  • #53891 Document deployment.kubernetes.io/* annotations
  • #53892 kubectl apply view-last-applied docs

TiKV

Distributed transactional key-value database

  • #19225 Add AGENTS.md for AI agent guidance

KEDA

Kubernetes event-driven autoscaling

  • keda-docs#1658 Remove deprecated metricName from docs
  • keda-docs#1769 Fix datadog scaler typos across all versions
  • #7538 GPU/AI inference scaler architectural analysis

OpenTelemetry

Observability framework

  • #8632 Add .NET troubleshooting page

Metal³

Bare metal host provisioning for K8s

  • #624 Fix redirect links in tryit.md

kpt

K8s-native packaging & resource management

  • #4278 Fix kpt fn doc for KRM functions

HAMi

Heterogeneous AI Computing Virtualization Middleware

  • #1893 Add unit tests for nvinternal info, mig, and watch packages

traceAI

Open-source LLM observability SDK

  • #165 Fix exporter shutdown and thread safety in Python SDK
  • #166 Add Go SDK with OpenAI instrumentor

Scholarly Articles & Publications

20+ published articles and research papers on Cloud-Native, Kubernetes, AI/ML Operations, and Platform Engineering

AI & Agentic Systems

1

AI Security: Preemptive Cybersecurity — Using AI Agents for Proactive Threat Hunting in Cloud-Native Environments

2

Agentic AI Introduction: Model Context Protocol (MCP) — Bridging LLMs and Real-Time Kubernetes Observability

3

Scale & LLM-Ops: Architecting LLM-as-a-Service — Infrastructure for High-Concurrency Agentic Workloads

SRE & Self-Healing Infrastructure

4

Agentic SRE Teams: Human-Agent Collaboration — A New Operational Model for Autonomous Incident Response

5

Autonomous Remediation: Reinforcement Learning for Self-Healing Infrastructure and Human-Agent Collaboration

6

From PagerDuty to ‘Agentic Ops’: The Rise of Self-Healing Kubernetes

Platform Engineering & GitOps

7

Platform Engineering Foundations: The IDP — Reducing Cognitive Load for Java Developers

8

GitOps & Stability: Formal Verification of ArgoCD Manifests — Preventing Deployment Drift

9

Beyond Basic Sync: Why ArgoCD v3 is the Backbone of Modern Platform Engineering

Kubernetes & Cloud Infrastructure

10

The Efficiency Era: How Kubernetes v1.35 Finally Solves the “Restart” Headache

11

FinOps: Predictive Autoscaling Using Time-Series Analysis to Reduce Cloud Waste in EKS Clusters

12

Zero-Trust Infrastructure: Automated Identity Governance in Kubernetes — Framework for Zero-Trust Microservices

13

Multi-Cluster Orchestration: Cross-Cluster Service Meshes in High-Traffic Retail Environments

Featured Projects

Open source tools for GPU autoscaling, observability, and topology-aware infrastructure

KEDA GPU Scaler Independent Repository

Independent repository developing an event-driven GPU autoscaler using KEDA’s External gRPC Scaler interface. Native NVML metrics, DaemonSet deployment, pre-built scaling profiles for vLLM, Triton, and training workloads. Not yet merged into the KEDA core repository.

GogRPCNVMLKubernetesHelm

Referenced in KEDA #7538

OpenTelemetry GPU Receiver

OpenTelemetry Collector receiver for NVIDIA GPU metrics. GPU utilization, memory, temperature via NVML. Standard OTLP export with built-in Prometheus exporter.

GoOpenTelemetryNVMLPrometheus

Kube Topology Agent

Kubernetes knowledge graph & automated root-cause analysis. Real-time resource topology, graph-based incident investigation, AlertManager webhook integration.

GoKubernetes APIKnowledge GraphHelm

KubeAI Autoscaler

Kubernetes-native autoscaler for AI inference workloads. Custom scaling algorithms, GPU-focused policies, latency SLA enforcement, Prometheus metrics.

GoKubernetesCRDHelm

Golden Kubestronaut Learning

Comprehensive Kubernetes certification study guides covering all CNCF certifications. Interactive quizzes, flashcards, lab exercises, and PDF generation.

MkDocsPythonKubernetesEducation

Ingress2Gateway

Convert Kubernetes Ingress resources to Gateway API. Supports ALB, GCE, Nginx annotations with automated migration and validation.

PythonKubernetesGateway APIHelm

Technical Expertise

Container Orchestration & GitOps

KubernetesArgoCDDockerCrossplaneHelmFlux

Cloud Platforms

AWSAzureEKSEC2S3IAM

Observability

PrometheusGrafanaOpenTelemetrySplunkDatadog

Policy & Security

KyvernoOPAZero-TrustRBACNetwork Policies

CI/CD

GitHub ActionsJenkinsFluxUrbanCode Deploy

Languages & Tools

GoPythonRustTerraformgRPCBash

GPU / AI Infrastructure

NVIDIA NVMLCUDAvLLMTritonKEDAVolcano

Big Data

PrestoDBTrinoApache SupersetAlluxioJupyter

Technical Writing

Industry publications, foundation blogs, and personal technical writing

Industry Writing

Academic & Foundation Writing

Personal Blogs

Conferences & Speaking

Conference presentations on cloud-native infrastructure, GitOps, and HPC

HPSF Conference 2026 Productivity, Performance & the HPC Pipeline

GitOps for HPC: Bringing Cloud-Native DevOps Practices to High Performance Computing Environments

Pavan Madduri, W.W. Grainger

Applying ArgoCD, Kubernetes, and GitOps workflows to HPC environments — bridging the gap between cloud-native DevOps and scientific computing.

Chicago River Ballroom A-D Intermediate
HPSF Conference 2026 Building & Sustaining Community

DevOps for Scientific Software: Tools, Practices, and Automation Strategies

Pavan Madduri, W.W. Grainger

CI/CD pipelines, testing strategies, and automation for scientific and research software development — making open source science reproducible and maintainable.

Chicago River Ballroom A-D Beginner
InfoQ Live · Apr 21, 2026 Roundtable Panel

AI-Powered SRE for Autonomous Incident Response

Pavan Madduri (Grainger), Rohit Dhawan (Amazon), Alina Astapovich (Storytel), Goutham Rao (NeuBird) · Moderated by Renato Losio (InfoQ)

How AI agents and generative models are being used for incident detection, root cause analysis, and automated remediation — reducing MTTR and operational load at scale.

Online Panel Discussion

Media Mentions & Expert Commentary

Quoted as a subject-matter expert across 11+ publications on enterprise AI, GPU infrastructure, cloud security, and platform engineering

AI Business Informa PLC

OpenAI vs. Anthropic vs. Google: But the Model Isn’t the Point

“The real dependency risk comes from the orchestration, workflow and data integration layers built around them… Relying on third-party orchestration is where real lock-ins happen.”

VKTR Simpler Media Group

Enterprise AI Costs Climb as GPU Demand Outpaces Supply

“The architecture that works is a routing layer: simple tasks go to a lightweight SLM, complex reasoning escalates to the frontier model. You stop paying frontier prices for envelope-delivery workloads.”

Techopedia 10M+ monthly visitors

AI Experts Call for a Reality Check on Allbirds’ Pivot

“GPU capacity is genuinely hard to get right now… You can’t buy that institutional knowledge with a convertible note and a rebrand.”

Reworked Simpler Media Group

AI Agents and the Process Documentation Fallacy

“If an AI agent is trained purely by observing the official workflow in the ticketing platform, it’s learning a fantasy… You have to fence the AI in.”

InfoSec Relations Cybersecurity

Agentic AI is Exposing the Accountability Gap in Cloud Security

“We enforce this with Policy-as-Code at the admission layer, so the agent’s available responses are constrained by the infrastructure itself, not by a governance doc that someone wrote once and nobody checks.”

Tech Round UK International

Meta Acquires Moltbook: What Responsibility Do Regulators Have?

“We are building autonomous agents without implementing Zero Trust security… Regulators must urgently pivot to regulating Agentic Privileges.”

TLDR Newsletter 3M+ subscribers

Featured Mention

CNCF GPU autoscaling blog featured to 3M+ subscribers — one of the largest daily tech newsletters globally.

Habr (VKTech) 10M+ visitors

GPU Auto-Scaling on Kubernetes with KEDA

Russian-language adaptation of CNCF blog by VKTech (VK/Mail.ru Group) — 4,500+ views in first 13 hours. International reach beyond English-speaking audience.

Cloud Native Now Techstrong Group

Stop Wasting GPU Budget: Autoscaling AI Inference with KEDA

Primary author — GPU autoscaling architecture, keda-gpu-scaler, and scale-to-zero for AI inference on Kubernetes.

Y Square Technology Tech Analysis

AI Agent Documentation Reality Gap

Quoted on enterprise AI agent deployment challenges and the gap between documented processes and operational reality.

CNCF Official Recognition

CNCF LinkedIn 500K+ followers

GPU Autoscaling with KEDA

“Pavan Madduri breaks down how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly — cutting metric latency from 15–30s to 2–4s.”
204+ likes · 28 reposts · 3 comments

CNCF Twitter/X @CloudNativeFdn

GPU Autoscaling with KEDA

“See how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly, with scaling profiles for vLLM, Triton, and training workloads.”
2,122 views · 24 likes · 7 bookmarks

CNCF Bluesky cncf.io

GPU Autoscaling with KEDA

Featured across all 3 CNCF social platforms — LinkedIn, Twitter/X, and Bluesky.

CNCF LinkedIn 500K+ followers

Golden Kubestronaut Journey

“From public static void main to Golden Kubestronaut: The Art of Unlearning — Pavan Madduri shares his journey through all five Kubernetes certifications.”
26+ likes · 1 repost

Industry Advisory & Collaboration

Industry Advisory & Collaboration

Providing architectural feedback and early platform contributions for enterprise AI agents (e.g., Future AGI). Coordinating technical documentation and letters of support with key open-source project maintainers across CNCF foundations.

Judging & Peer Review

Serving as a technical peer reviewer and judge for international IEEE conferences and journals

IEEE Technical Peer Reviewer & Judge

IEEE Internet of Things Journal (IoT)

Peer reviewing submissions on IoT architectures, edge computing, and distributed systems for one of IEEE’s highest-impact journals.

IEEE AIEEE 2026 Technical Peer Reviewer & Judge

IEEE International Conference on AI & Electrical Engineering (AIEEE 2026)

Reviewing research papers on AI systems, electrical engineering, and their intersection with cloud-native infrastructure.

IEEE CloudCOM 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Cloud Computing Technology & Science (CloudCOM 2026)

Evaluating papers on cloud computing architectures, container orchestration, and scalable infrastructure design.

IEEE COMM 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Communications (COMM 2026)

Reviewing research on network communications, distributed systems, and telecom infrastructure.

IEEE ICCCN 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Computer Communications & Networks (ICCCN 2026)

Evaluating submissions on computer networking, cloud infrastructure, and distributed computing systems.

Let's Connect

Always open to connecting with fellow engineers in the cloud-native and AI/ML space