Skills to guide Claude Code, Codex, and other coding agents on using the Weights & Biases AI developer platform to train models and build agents.
- Log metrics and rich media during model training and fine-tuning
- Track model training experiments
- Analyze runs and experiment results to understand how the model is learning
- Tune hyperparameters
- Trace agentic AI applications
- Analyze traces and classify them into failure modes
- Evaluate models with labeled datasets
- Run online evaluations for production monitoring
npx skills add wandb/skillsThen set your W&B API key:
export WANDB_API_KEY=<your-key>
npx skillsis a utility for installing skills into major coding agent CLIs. Use--globalto install for all projects, or--agent <name>to target a specific agent. See the npx skills docs for more details.
| Skill | Description | Status |
|---|---|---|
wandb-primary |
Primary W&B skill for broad, mixed-surface W&B project analysis and workflows across runs, Weave, Reports, Signal Builder, and Launch. | experimental |
We maintain Skill Bench in this repository to evaluate public skill changes across coding agents and task categories. Skill Bench uses W&B Agent Factory as the eval runtime for task definitions, agent profiles, sandbox execution, and structured bench rows.
Pull requests run package validation by default. A maintainer can trigger live Skill Bench runs for larger changes.
Plan a local benchmark without model calls:
python3 -m skillbench.cli plan \
--wbaf-root ../WandBAgentFactory \
--candidate-ref HEAD \
--skill wandb-primary| Category | Tasks | Claude Code (sonnet4.6) |
Codex (gpt-5.3-codex) |
|---|---|---|---|
| Weave analysis | 26 | 97%* | 63%* |
| Weave tooling | 11 | 95%* | 83%* |
| Model training | 8 | 90%* | 85%* |
| LLM finetuning & RL analysis | 14 | 72%* | 86%* |
| Failure & outlier detection | 8 | 86%* | 63%* |
*Pass rates are +/- 3%. Many tasks span multiple categories.
See CONTRIBUTING.md.