Skip to content

Qwen3-VL #1063

@ridcl

Description

@ridcl

Is your feature request related to a problem? Please describe.

Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.

Describe the solution you'd like

According to the technical report, Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511.

It's also worth to mention:

  • Interleaved MRoPE, which we don't seem to support yet
  • DeepStack for the vision-language merger, which is a bit more complicated than what we gave for Gemma 3
  • Video timestamp

Additional context

A couple of design questions:

  1. Should we wait for Uiuc vlm pr compressed fixed #511 to be merged or start the work on Qwen3-VL in parallel? Calling @abheesht17 for an opinion.
  2. Should we extend (text-only) Qwen3 or create a totally new model? I'm not sure it will be easy to integrate DeepStack without changing the way of the text-only version.

Checklist

  • I have searched the existing issues for similar feature requests.
  • This is not a support question (please use the "bug template" for that).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions