Skip to content

Qengineering/Qwen3-VL-NPU-VIDEO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen3-VL NPU VIDEO

Sequence

User: Describe the video.

Answer: The video shows a man performing a martial arts move in a large, open room with a light blue wall and green mat flooring.

  • The man is wearing dark clothing.
  • He begins by standing on his right foot, then performs a high kick to his left leg while simultaneously bending forward at the waist.
  • As he executes this movement, he appears to be performing a martial arts technique that involves a combination of kicks and body control.
  • After completing the move, he falls backwards onto the mat, landing on his back with his legs extended.

The video captures the man's dynamic motion as he performs the martial arts move.


Qwen3 VLM VIDEO for RK3588 NPU (Rock 5, Orange Pi 5).

License

Paper: Qwen3 Technical Report

Hugging face: https://huggingface.co/collections/Qwen/qwen3-vl


Introduction

LLMs (Large Language Models) are neural networks trained on large text datasets to understand and generate language.
VLMs (Vision-Language Models) add a visual encoder so the model can process images and text together.
A combined VLM+LLM system is often referred to as a multimodal model.

These models can be large—hundreds of millions to billions of parameters—which impacts accuracy, memory use, and runtime speed.
On edge devices like the RK3588, available RAM and compute are limited, and even the NPU has strict constraints on supported operations.
Because of this, models typically need to be quantised or simplified to fit.

Performance is usually expressed in tokens (words) per second.
Once converted to RKNN, parts of the model can run on the NPU, improving speed.

Qwen3_chart

❗Showstopper❗

To process video input, individual frames are first extracted. The VLM converts each frame into embeddings, which are then transformed into vision tokens — as illustrated in the schema above.
Even on a desktop PC, this process places a heavy load on memory and CUDA resources. It’s therefore no surprise that the Rock 5C, with its more limited hardware, struggles even more.
The RKLLM library supports a maximum of 4092 tokens in total. Each vision token corresponds to about 200 tokens, limiting processing to 20 frames per video. To stay within this constraint, the video is subsampled, and evenly spaced frames are extracted for processing by Qwen3.

For reference, each vision token occupies around 20 MB of RAM — a detail worth keeping in mind when working on systems with limited memory.


Model performance benchmark (FPS)

All models below can handle multiple frames. Best performing is Qwen3, as the 'parent' of InternVL3.5 and SmolVLM2.

All LLM models are quantized to w8a8, while the VLM vision encoders use fp16.

model RAM (GB)1 llm cold sec2 llm warm sec3 vlm cold sec2 vlm warm sec3 Resolution Tokens/s
Qwen3-2B 3.1 21.9 2.6 10.0 0.9 448 x 448 11.5
Qwen3-4B 8.7 49.6 5.6 10.6 1.1 448 x 448 5.7
InternVL3.5-1B 1.9 8.3 8.0 1.5 0.8 448 x 448 24
InternVL3.5-2B 3.0 22 8.0 2.7 0.8 448 x 448 11.2
InternVL3.5-4B 5.4 50 8.0 5.9 0.8 448 x 448 5
InternVL3.5-8B 8.8 92 8.0 50.5 5.8 448 x 448 3.5
SmolVLM2-2.2B 3.4 21.2 2.6 10.5 0.9 384 x 384 11
SmolVLM2-500M 0.8 4.8 0.7 2.5 0.25 384 x 384 31
SmolVLM2-256M 0.5 1.1 0.4 2.5 0.25 384 x 384 54

1 The total used memory; LLM plus the VLM.
2 When an llm/vlm model is loaded for the first time from your disk to RAM or NPU, it is called a cold start.
The duration depends on your OS, I/O transfer rate, and memory mapping.
3 Subsequent loading (warm start) takes advantage of the already mapped data in RAM. Mostly, only a few pointers need to be restored.

Plot_1
Plot_2


Dependencies.

To run the application, you have to:

  • OpenCV 64-bit installed.
  • rkllm library.
  • rknn library.
  • Optional: Code::Blocks. ($ sudo apt-get install codeblocks)

Installing the dependencies.

Start with the usual

$ sudo apt-get update 
$ sudo apt-get upgrade
$ sudo apt-get install cmake wget curl

OpenCV

To install OpenCV on your SBC, follow the Raspberry Pi 4 guide.

Or, when you have no intentions to program code:

$ sudo apt-get install libopencv-dev 

Installing the app.

$ git clone https://github.com/Qengineering/Qwen3-VL-NPU-VIDEO

RKLLM, RKNN

To run InternVL3, you need to have the rkllm-runtime library version 1.2.3 (or higher) installed, as well as the rknpu driver version 0.9.8.
If you don't have these on your machine, or if you have a lower version, you need to install them.
We have provided the correct versions in the repo.

$ cd ./Qwen3-VL-2B-NPU/aarch64/library
$ sudo cp ./*.so /usr/local/lib
$ cd ../include
$ sudo cp ./*.h /usr/local/include

Download the LLM and VLM model.

The next step is downloading the models.
All models are on our Sync.com server. Please look at the repository of your choice for the appropriate links.
For instance, Qwen3-VL-2B:
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm and qwen3-vl-2b-vision_rk3588.rknn
Copy both into this folder.

Building the app.

Once you have the two models, it is time to build your application.
You can use Code::Blocks.

  • Load the project file *.cbp in Code::Blocks.
  • Select Release, not Debug.
  • Compile and run with F9.
  • You can alter command line arguments with Project -> Set programs arguments...

Or use Cmake.

$ mkdir build
$ cd build
$ cmake ..
$ make -j4

Running the app.

The app supports two modes:

Video.

./VLM_VIDEO_NPU RKNN_model RKLLM_model Video Frames
Argument Comment
RKNN_model The visual encoder model (vlm)
RKLLM_model The large language model (llm)
Video The video. Can be mp4, avi, mov, or mkv
Frames Optional, default 8

Each frame requires about 2.24 seconds of VLM loading time.
Increasing the number of frames also extends the overall thinking time, since all corresponding vision tokens must be processed.

Files.

./VLM_VIDEO_NPU RKNN_model RKLLM_model File1 File2 File3 ... FileX
Argument Comment
RKNN_model The visual encoder model (vlm)
RKLLM_model The large language model (llm)
File The individual image file.

Each file requires about 2.24 seconds of VLM loading time.
Increasing the number of frames also extends the overall thinking time, since all corresponding vision tokens must be processed.

Using the app.

Using the application is simple. Once you provide the video and the models, you can ask everything you want.
Remember, we are on a bare Rock5C, so don't expect the same quality answers as ChatGPT can provide.
If you want to talk about the video, you need to include the token <image> in your prompt once.
The app remembers the dialogue until you give the token <clear>.
With <exit>, you leave the application.


To get a taste, try our professional Qwen3 AI-chatbot running on a Rock 5C: https://rock5gpt.qengineering.eu Rock5GPT


paypal

Releases

No releases published

Packages

 
 
 

Contributors