Fix/omniparser predict refactor #529
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I generated this summary from my session during coding. but all the relevant documentation is linked below. I wanted to stress that the main issue was the input/ output mismatch.
And to test this:
and for anthropic:
OmniParser: Migration from
aresponsestoacompletionProblem
OmniParser loop was using
litellm.aresponses()which is buggy and poorly supported across providers, causing errors like:Root Cause
Per LiteLLM documentation:
aresponses(): Uses non-standardinputparameter, provider-specific format, minimal documentationacompletion(): Uses standardmessagesparameter, universal provider support, extensive documentationThe moondream3 loop already uses
acompletion()successfully.Changes Applied
1. API Call Migration (lines 344, 330, 377)
Why: Standard completion API is stable across all providers (OpenAI, Anthropic, Gemini, etc.)
2. Tool Schema Format (lines 23-85)
Why: Completion API requires schema wrapped in
functionkey per Anthropic tool calling docs.3. Message Format Conversion (lines 370-372, 407-411)
Why: Following moondream3 pattern (lines 398-441). Uses proven helpers from
responses.py.4. Coordinate Normalization (lines 305-333)
Why: OmniParser returns normalized coordinates but computer handler expects pixels. Pattern from anthropic.py:72-77 and openai.py:20-25.
5. Annotated Image Injection (lines 335-337)
Why: LLM must see numbered overlays to choose valid element IDs (1-58) instead of hallucinating IDs like 535.
6. Element ID Conversion (lines 422-445)
Why: Mirrors moondream3's
convert_computer_calls_desc2xy()pattern (responses.py:305-351).7. Schema Required Fields (line 82)
Why: Anthropic strictly follows JSON schema. Without
element_idinrequired, Claude treats it as optional per Anthropic tool schema spec.Results
acompletionReferences