Small reproducible cases for an OpenAI-compatible chat completions endpoint.
Set BASE_URL and MODEL in your environment before running. If the endpoint
requires auth, also set API_KEY. Do not commit endpoint URLs, model IDs, or
credentials if they are not meant to be public.
Run all cases:
BASE_URL="https://example.com/v1" \
MODEL="your-model-id" \
python3 run_cases.pyRun with thinking enabled and a shared token budget:
BASE_URL="https://example.com/v1" \
MODEL="your-model-id" \
ENABLE_THINKING=true \
MAX_TOKENS=512 \
python3 run_cases.pySee observed_failures.json for structured examples of failure responses captured during local testing.
Prompt:
你好,2+3等于几?只回答数字。
Expected:
5
Observed in one run:
{
"content": "",
"prompt_tokens": 17,
"completion_tokens": 1,
"total_tokens": 18,
"finish_reason": "stop"
}Same prompt and request settings also produced a garbled completion in another run:
!5.83和1:h3+83 are4-01, Campbell.02.t2 More2 but7 .. ...0;0.531678838 or5确切6659???? ))
7e5-at-O5434059 +5 user
The response used all max_tokens=64 and ended with finish_reason=length.
With:
{
"temperature": 0,
"max_tokens": 64,
"chat_template_kwargs": {
"enable_thinking": true
}
}Both short math prompts repeatedly spent the full completion budget on the reasoning field and returned no final content:
{
"content": null,
"finish_reason": "length",
"completion_tokens": 64
}The reasoning text often contained the correct computation, but the final answer was never emitted before the token budget ended.
For the Chinese short math prompt, increasing max_tokens helped but did not
fully eliminate failures until the largest tested budget:
enable_thinking=true, temperature=0
case max_tokens correct null_content length_finish
zh_math_short 512 3/5 1 1
en_math_short 512 5/5 0 0
zh_math_short 1024 3/5 1 0
en_math_short 1024 5/5 0 0
zh_math_short 2048 5/5 0 0
en_math_short 2048 5/5 0 0
Observed failure modes at 512/1024 included:
content=nullwhile the reasoning field contained the answer.- garbled content despite
temperature=0. - completions running to
finish_reason=length.
Longer generation cases were mostly stable and measured roughly 39 completion tokens/sec in local tests. Short completions should not be used for throughput because fixed request latency dominates.