Skip to content

Latest commit

 

History

History
127 lines (92 loc) · 2.91 KB

File metadata and controls

127 lines (92 loc) · 2.91 KB

Model Case Repros

Small reproducible cases for an OpenAI-compatible chat completions endpoint.

Set BASE_URL and MODEL in your environment before running. If the endpoint requires auth, also set API_KEY. Do not commit endpoint URLs, model IDs, or credentials if they are not meant to be public.

Quick Repro

Run all cases:

BASE_URL="https://example.com/v1" \
MODEL="your-model-id" \
python3 run_cases.py

Run with thinking enabled and a shared token budget:

BASE_URL="https://example.com/v1" \
MODEL="your-model-id" \
ENABLE_THINKING=true \
MAX_TOKENS=512 \
python3 run_cases.py

See observed_failures.json for structured examples of failure responses captured during local testing.

Notable Failures Observed

Chinese Short Math: Empty Output

Prompt:

你好,2+3等于几?只回答数字。

Expected:

5

Observed in one run:

{
  "content": "",
  "prompt_tokens": 17,
  "completion_tokens": 1,
  "total_tokens": 18,
  "finish_reason": "stop"
}

Chinese Short Math: Garbled Output

Same prompt and request settings also produced a garbled completion in another run:

!5.83和1:h3+83 are4-01, Campbell.02.t2 More2 but7 .. ...0;0.531678838 or5确切6659???? ))

7e5-at-O5434059 +5 user

The response used all max_tokens=64 and ended with finish_reason=length.

Thinking Enabled: Small Budget Produces No Final Content

With:

{
  "temperature": 0,
  "max_tokens": 64,
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

Both short math prompts repeatedly spent the full completion budget on the reasoning field and returned no final content:

{
  "content": null,
  "finish_reason": "length",
  "completion_tokens": 64
}

The reasoning text often contained the correct computation, but the final answer was never emitted before the token budget ended.

Thinking Enabled: Larger Budgets Still Intermittently Fail

For the Chinese short math prompt, increasing max_tokens helped but did not fully eliminate failures until the largest tested budget:

enable_thinking=true, temperature=0

case           max_tokens  correct  null_content  length_finish
zh_math_short  512         3/5      1             1
en_math_short  512         5/5      0             0
zh_math_short  1024        3/5      1             0
en_math_short  1024        5/5      0             0
zh_math_short  2048        5/5      0             0
en_math_short  2048        5/5      0             0

Observed failure modes at 512/1024 included:

  • content=null while the reasoning field contained the answer.
  • garbled content despite temperature=0.
  • completions running to finish_reason=length.

Throughput Observation

Longer generation cases were mostly stable and measured roughly 39 completion tokens/sec in local tests. Short completions should not be used for throughput because fixed request latency dominates.