Kubernetes manifests and test scripts for trying new open-weight LLMs on GPU clusters.
Each folder is one model and contains everything you need to pull weights, deploy an inference server, and smoke-test reasoning and tool calling via curl. Current inference backend is vLLM; SGLang and TensorRT-LLM variants may land in the same folder structure when useful.
modelyard/
└── <model-name>/
├── README # run instructions, flag reference, test results, caveats
├── pvc.yaml # PersistentVolumeClaim for model weights
├── model-download.yaml # one-shot Job that pulls weights from HF
├── <model>.yaml # Deployment + Service (recommended aggregated config)
├── <model>-disagg.yaml # optional disaggregated prefill/decode variant
└── script.sh # curl-based smoke tests (reasoning + tool calling)
cd <model-name>
# 1. Set your storage class in pvc.yaml, then create the PVC and download weights:
kubectl apply -f pvc.yaml
kubectl apply -f model-download.yaml
# 2. Deploy the server once weights are downloaded:
kubectl apply -f <model>.yaml
# 3. Port-forward and run the smoke tests:
kubectl port-forward pod/<pod-name> 8000:8000
./script.shSee each folder's own README for model-specific flags, expected resources, and any quirks worth knowing about.
| Folder | Model | Backend | Notes |
|---|---|---|---|
deepseek-v4-flash |
deepseek-ai/DeepSeek-V4-Flash (284B MoE) | vLLM | Aggregated (4× H100/H200) or disaggregated prefill/decode |
- Not production-ready. Expect to tune resource requests, probes, and strategies for your cluster.
- GPU assignment assumes the NVIDIA k8s device plugin. Time-slicing / MIG configurations may need extra tweaks.
- Recommended deployment strategy is
Recreatefor single-replica GPU-heavy workloads — a rolling update can leave two pods fighting over the same GPUs.