VLM Context size for local models (e.g. Ollama) #2784

98h398hrpohpoai · 2025-12-15T05:33:08Z

98h398hrpohpoai
Dec 15, 2025

Is it possible to set context size using VLM with local options like Ollama?
The default is set to 4096 which seems slow, and haven't seen a way to adjust that specifically in the docs.
Tried options for the remote models, but those didn't apply.

@98h398hrpohpoai · 2025-12-15T05:36:38Z

dosubot[bot]
bot Dec 15, 2025

Hi @98h398hrpohpoai! I'm Dosu and I’m helping the docling team.

Yes, you can adjust the context size for local VLM models like Ollama in Docling, but it isn't exposed as a dedicated option in the UI or CLI. Instead, you need to set the max_tokens parameter manually in the params dictionary when creating your ApiVlmOptions. For example:

from docling.datamodel.pipeline_options import ApiVlmOptions

options = ApiVlmOptions(
    url="http://localhost:11434/v1/chat/completions",
    params={
        "model": "your-ollama-model",
        "max_tokens": 2048,  # Set your desired context size here
    },
    prompt="Your prompt here",
    timeout=90,
    response_format=ResponseFormat.MARKDOWN,
)

This params dictionary is passed directly to the Ollama API, so you can include any supported Ollama parameters there. The default context size of 4096 only applies to inline models unless you override it for Ollama in this way. Just make sure the model you’re using with Ollama supports the context size you specify, and be aware that Ollama’s API sometimes has quirks with OpenAI compatibility, but this doesn’t affect context size configuration itself (reference).

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

98h398hrpohpoai · 2025-12-15T05:51:27Z

98h398hrpohpoai
Dec 15, 2025
Author

Hi, and thanks!

I've tried that, however the context remains at the default 4096.

# Initialize document converter
pipeline_options = VlmPipelineOptions(enable_remote_services=True)
pipeline_options.vlm_options = ApiVlmOptions(
    url="http://localhost:11434/v1/chat/completions",
    params={
        "model": "qwen2.5vl:72b-q8_0",
        "max_tokens": 8192,  
    },
    prompt="OCR the full page to markdown.",
    timeout=300,
    temperature=0.0,
    concurrency=4,
    scale=1.0,
    response_format=ResponseFormat.MARKDOWN,
)

ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen2.5vl:72b-q8_0 7ab696d4a6e8 81 GB 100% GPU 4096 4 minutes from now

Edit: Work around for now is to set Ollama context size globally (e.g. Environment="OLLAMA_CONTEXT_LENGTH=98304"), but this isn't a great long-term solution or scale well with multiple tools/agents/etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VLM Context size for local models (e.g. Ollama) #2784

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

VLM Context size for local models (e.g. Ollama) #2784

Uh oh!

98h398hrpohpoai Dec 15, 2025

Replies: 2 comments

Uh oh!

dosubot[bot] bot Dec 15, 2025

Uh oh!

Uh oh!

98h398hrpohpoai Dec 15, 2025 Author

98h398hrpohpoai
Dec 15, 2025

dosubot[bot]
bot Dec 15, 2025

98h398hrpohpoai
Dec 15, 2025
Author