Model Case Repros

Small reproducible cases for an OpenAI-compatible chat completions endpoint.

Set BASE_URL and MODEL in your environment before running. If the endpoint requires auth, also set API_KEY. Do not commit endpoint URLs, model IDs, or credentials if they are not meant to be public.

Quick Repro

Run all cases:

BASE_URL="https://example.com/v1" \
MODEL="your-model-id" \
python3 run_cases.py

Run with thinking enabled and a shared token budget:

BASE_URL="https://example.com/v1" \
MODEL="your-model-id" \
ENABLE_THINKING=true \
MAX_TOKENS=512 \
python3 run_cases.py

See observed_failures.json for structured examples of failure responses captured during local testing.

Notable Failures Observed

Chinese Short Math: Empty Output

Prompt:

你好，2+3等于几？只回答数字。

Expected:

Observed in one run:

{
  "content": "",
  "prompt_tokens": 17,
  "completion_tokens": 1,
  "total_tokens": 18,
  "finish_reason": "stop"
}

Chinese Short Math: Garbled Output

Same prompt and request settings also produced a garbled completion in another run:

!5.83和1:h3+83 are4-01, Campbell.02.t2 More2 but7 .. ...0;0.531678838 or5确切6659???? ))

7e5-at-O5434059 +5 user

The response used all max_tokens=64 and ended with finish_reason=length.

Thinking Enabled: Small Budget Produces No Final Content

With:

{
  "temperature": 0,
  "max_tokens": 64,
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

Both short math prompts repeatedly spent the full completion budget on the reasoning field and returned no final content:

{
  "content": null,
  "finish_reason": "length",
  "completion_tokens": 64
}

The reasoning text often contained the correct computation, but the final answer was never emitted before the token budget ended.

Thinking Enabled: Larger Budgets Still Intermittently Fail

For the Chinese short math prompt, increasing max_tokens helped but did not fully eliminate failures until the largest tested budget:

enable_thinking=true, temperature=0

case           max_tokens  correct  null_content  length_finish
zh_math_short  512         3/5      1             1
en_math_short  512         5/5      0             0
zh_math_short  1024        3/5      1             0
en_math_short  1024        5/5      0             0
zh_math_short  2048        5/5      0             0
en_math_short  2048        5/5      0             0

Observed failure modes at 512/1024 included:

content=null while the reasoning field contained the answer.
garbled content despite temperature=0.
completions running to finish_reason=length.

Throughput Observation

Longer generation cases were mostly stable and measured roughly 39 completion tokens/sec in local tests. Short completions should not be used for throughput because fixed request latency dominates.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cases.json		cases.json
observed_failures.json		observed_failures.json
run_cases.py		run_cases.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Case Repros

Quick Repro

Notable Failures Observed

Chinese Short Math: Empty Output

Chinese Short Math: Garbled Output

Thinking Enabled: Small Budget Produces No Final Content

Thinking Enabled: Larger Budgets Still Intermittently Fail

Throughput Observation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Model Case Repros

Quick Repro

Notable Failures Observed

Chinese Short Math: Empty Output

Chinese Short Math: Garbled Output

Thinking Enabled: Small Budget Produces No Final Content

Thinking Enabled: Larger Budgets Still Intermittently Fail

Throughput Observation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages