google · copybara-service · Jun 4, 2026
diff --git a/skills/cloud/agent-platform-eval-flywheel/SKILL.md b/skills/cloud/agent-platform-eval-flywheel/SKILL.md
@@ -1,14 +1,14 @@
 ---
 name: agent-platform-eval-flywheel
 description: >-
-  Measure and improve the quality of AI models and agents on Google Cloud
+  Measures and improves the quality of AI models and agents on Google Cloud
   using the Eval Quality Flywheel methodology. Use when evaluating an agent or
   model, building an eval dataset, picking or writing evaluation metrics,
   analyzing failures, comparing results before and after a fix, or when
   guidance is needed on Agent Platform eval methodology — including
   dataset schema, LLM-as-judge scoring, and common failure causes. For
-  fine-tuning, use agent-platform-tuning. For deployment, use
-  agent-platform-deploy.
+  fine-tuning, use agent-platform-tuning. For general production deployment,
+  use agent-platform-deploy.
 ---
 
 # Agent Platform Eval Flywheel Skill
@@ -25,6 +25,11 @@ the Agent Platform GenAI Evaluation SDK (`google.genai` / `agentplatform`).
 -   Selecting, configuring, or writing custom evaluation metrics.
 -   Analyzing rubric verdicts, loss patterns, and clustering failures.
 -   Suggesting concrete code/prompt improvements based on eval results.
+-   Evaluating a model served on an Agent Platform **endpoint** (BYOM) or a
+    **Model-as-a-Service (MaaS)** model by ID — including deploying the model
+    first if needed. For this case, follow
+    [references/deployment.md](references/deployment.md) and use the
+    `endpoint_evaluation.py` / `maas_evaluation.py` scripts.
 
 ## Setup
 
@@ -348,10 +353,12 @@ patterns: synthetic data generation, pairwise comparison,
 
 ## Bundled scripts
 
-| Script                  | When to use                                                                          |
-| ----------------------- | ------------------------------------------------------------------------------------ |
-| `validate_dataset.py`   | Before Stage 3 — catch malformed `EvaluationDataset` JSON.                           |
-| `parse_adk_traces.py`   | Stage 1 — convert ADK session dumps to the canonical dataset shape.                  |
-| `inspect_results.py`    | Stages 3/4 — render summary + per-case scores. `--save-html` for a browsable report. |
-| `compare_results.py`    | Stage 5 — diff baseline vs. candidate, detect regressions.                           |
-| `render_html_report.py` | Render HTML from a saved result JSON or loss-clusters JSON.                          |
+Script                   | When to use
+------------------------ | -----------
+`validate_dataset.py`    | Before Stage 3 — catch malformed `EvaluationDataset` JSON.
+`parse_adk_traces.py`    | Stage 1 — convert ADK session dumps to the canonical dataset shape.
+`inspect_results.py`     | Stages 3/4 — render summary + per-case scores. `--save-html` for a browsable report.
+`compare_results.py`     | Stage 5 — diff baseline vs. candidate, detect regressions.
+`render_html_report.py`  | Render HTML from a saved result JSON or loss-clusters JSON.
+`endpoint_evaluation.py` | Stages 2/3 against a deployed Agent Platform endpoint (BYOM). See [references/deployment.md](references/deployment.md).
+`maas_evaluation.py`     | Stages 2/3 against a Model-as-a-Service model by ID. See [references/deployment.md](references/deployment.md).
diff --git a/skills/cloud/agent-platform-eval-flywheel/references/deployment.md b/skills/cloud/agent-platform-eval-flywheel/references/deployment.md
@@ -0,0 +1,201 @@
+# Deploying and Evaluating Models on Agent Platform Endpoints
+
+Reference for the deployment-and-endpoint-evaluation subset of the Quality
+Flywheel. Use this guide when the user needs to evaluate a model that is **not
+yet served** (must be deployed first) or is served on an Agent Platform
+**endpoint** rather than called through the `agentplatform` Python client
+directly.
+
+The two scripts referenced here — `scripts/endpoint_evaluation.py` and
+`scripts/maas_evaluation.py` — wrap Stage 2 (inference) and Stage 3 (grading) of
+the Flywheel against deployed endpoints. They produce the same
+`EvaluationResult` shape consumed by `scripts/inspect_results.py` and
+`scripts/compare_results.py`, so the rest of the Flywheel (Stages 4 and 5)
+applies unchanged.
+
+For the public-docs version of this workflow, see the
+[Agent Platform model evaluation docs](https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/eval-python-sdk/run-evaluation).
+
+## When to use this guide
+
+-   The user wants to evaluate a **custom-weights / Bring-Your-Own-Model
+    (BYOM)** model deployed to an Agent Platform endpoint.
+-   The user wants to evaluate a **Model-as-a-Service (MaaS)** model (e.g.
+    `meta/llama3-8b`, `gemini-1.5-pro`) by model ID.
+-   The user is at the deploy-then-eval stage: weights exist in GCS or in Model
+    Garden but no endpoint has been provisioned yet.
+
+For agent (multi-turn) evaluation, dataset preparation, metric selection,
+failure clustering, or iteration, stay in the main SKILL.md flow. This guide
+only covers the deployment + endpoint-inference subset.
+
+## Setup
+
+Install the deployment-eval dependencies (in addition to the base SDK install in
+the main SKILL.md):
+
+```bash
+python3 -m venv .model-eval
+source .model-eval/bin/activate
+pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0 requests
+```
+
+The scripts also shell out to `gcloud` and `gsutil`, so the user must have the
+Google Cloud SDK installed and `gcloud auth application-default login`
+completed.
+
+Confirm project / region / quota project:
+
+```bash
+gcloud config get project
+gcloud config get compute/region
+gcloud config get billing/quota_project
+```
+
+## 1. Deploy a Model
+
+> [!TIP]
+>
+> If the `agent-platform-deploy` skill is available, prefer it — it carries the
+> full deployment surface. This section is the minimum needed to get an endpoint
+> to evaluate against.
+
+```bash
+gcloud ai model-garden models deploy \
+    --project=$PROJECT_ID \
+    --region=$LOCATION_ID \
+    --model=$MODEL \
+    --machine-type=$MACHINE_TYPE \
+    --accelerator-type=$ACCELERATOR_TYPE \
+    --accelerator-count=$ACCELERATOR_COUNT \
+    --endpoint-display-name=$ENDPOINT_NAME \
+    --asynchronous
+```
+
+`MODEL` is either a Model Garden model ID (`meta/llama3-8b`) or a GCS URL to
+custom weights. If any of `MACHINE_TYPE`, `ACCELERATOR_TYPE`,
+`ACCELERATOR_COUNT` are missing, **stop and ask the user** — do not pick
+defaults silently. Confirm the full command with the user before running it.
+
+### Determining hardware requirements
+
+For custom weights stored in GCS, check for a tuning marker first:
+
+```bash
+gcloud storage cat $GCS_URL/managed_oss_fine_tuning_marker.json
+gcloud storage cat $GCS_URL/config.json
+```
+
+If neither is present, ask the user for the base model ID, then query
+recommended configs:
+
+```bash
+gcloud ai model-garden models list-deployment-config --model=$BASE_MODEL_ID
+```
+
+Summarize the recommended machine + accelerator combinations and confirm with
+the user before deploying.
+
+## 2. Check Deploy Status
+
+Deploys are asynchronous. Poll the operation:
+
+```bash
+gcloud ai operations describe $OPERATION_ID --region=$LOCATION_ID
+```
+
+Report the status to the user — including the "not found" case — without asking
+for confirmation first.
+
+## 3. Run Inference and Evaluation
+
+> [!WARNING]
+>
+> Both scripts use LLM-as-a-judge metrics. Tell the user up-front: **this can
+> take ~30 minutes per 50 samples**.
+
+If the scripts' imports (`google.cloud.aiplatform.agentplatform`) fail in the
+local environment, do **not** loop on `pip install` — fail fast and tell the
+user to install the deps from the Setup section above.
+
+Dataset format: JSONL in GCS with a `prompt` field on each line. See
+[dataset_schema.md](dataset_schema.md) for the per-metric column requirements.
+If the user's dataset doesn't have a `prompt` column, reformat it before
+running.
+
+Recommended starter metrics (confirm with the user first):
+
+-   `coherence` — does the response hold together logically?
+-   `fluency` — grammatical correctness and natural flow.
+-   `text_quality` — overall text quality.
+
+For the full metric catalog, see [metric_registry.md](metric_registry.md).
+
+Suggest an experiment name based on `${MODEL_ID}_${DATASET_NAME}` if the user
+doesn't provide one.
+
+### 3.1. Bring-Your-Own-Model (BYOM) Endpoint
+
+Inspect the endpoint first to confirm it exists and to capture the
+`dedicatedEndpointDns` (if dedicated) for the `--dedicated_endpoint_dns` flag:
+
+```bash
+gcloud ai endpoints describe $ENDPOINT_ID --region=$LOCATION_ID --project=$PROJECT_ID
+```
+
+If the endpoint ID or the dataset bucket doesn't exist, **stop and ask the
+user** for a valid value — don't propose a confirmation prompt with broken
+inputs.
+
+```bash
+python3 scripts/endpoint_evaluation.py \
+    --project_id=$PROJECT_ID \
+    --location=$LOCATION_ID \
+    --endpoint_id=$ENDPOINT_ID \
+    --dataset=gs://your-bucket/eval.jsonl \
+    --metrics "GENERAL_QUALITY"
+```
+
+With a dedicated DNS:
+
+```bash
+python3 scripts/endpoint_evaluation.py \
+    --project_id=$PROJECT_ID \
+    --location=$LOCATION_ID \
+    --endpoint_id=$ENDPOINT_ID \
+    --dataset=gs://your-bucket/eval.jsonl \
+    --dedicated_endpoint_dns=$DEDICATED_ENDPOINT_DNS \
+    --metrics "GENERAL_QUALITY"
+```
+
+### 3.2. Model-as-a-Service (MaaS)
+
+For MaaS models, no endpoint deploy is needed — call the model by ID:
+
+```bash
+python3 scripts/maas_evaluation.py \
+    --project_id=$PROJECT_ID \
+    --location=$LOCATION_ID \
+    --model_id=$MAAS_MODEL_ID \
+    --dataset=gs://your-bucket/eval.jsonl \
+    --metrics "GENERAL_QUALITY"
+```
+
+Same input-validation rule as 3.1: if the bucket doesn't resolve, stop and ask
+the user.
+
+## 4. Feed Results Back Into the Flywheel
+
+Both scripts print `summary_metrics` and the first 5 rows of `metrics_table`. To
+persist the full result for Stage 4/5 analysis, capture the returned
+`eval_result` and serialize it the same way as the main SKILL.md "Always persist
+the result" block. Then:
+
+-   `scripts/inspect_results.py --failing-only` for per-case triage.
+-   `scripts/compare_results.py --baseline <prev> --candidate <new>` after a fix
+    to confirm the target metric improved and nothing regressed.
+
+The deployment workflow is only Stages 1–3 of the Flywheel. Stages 4 (Analyze
+Failures) and 5 (Optimize & Iterate) work identically whether the result came
+from a deployed endpoint, a MaaS model, or a direct `client.evals.evaluate(...)`
+call.