Qwen3-VL NPU VIDEO

User: Describe the video.

Answer: The video shows a man performing a martial arts move in a large, open room with a light blue wall and green mat flooring.

The man is wearing dark clothing.
He begins by standing on his right foot, then performs a high kick to his left leg while simultaneously bending forward at the waist.
As he executes this movement, he appears to be performing a martial arts technique that involves a combination of kicks and body control.
After completing the move, he falls backwards onto the mat, landing on his back with his legs extended.

The video captures the man's dynamic motion as he performs the martial arts move.

Qwen3 VLM VIDEO for RK3588 NPU (Rock 5, Orange Pi 5).

Paper: Qwen3 Technical Report

Hugging face: https://huggingface.co/collections/Qwen/qwen3-vl

Introduction

LLMs (Large Language Models) are neural networks trained on large text datasets to understand and generate language.
VLMs (Vision-Language Models) add a visual encoder so the model can process images and text together.
A combined VLM+LLM system is often referred to as a multimodal model.

These models can be large—hundreds of millions to billions of parameters—which impacts accuracy, memory use, and runtime speed.
On edge devices like the RK3588, available RAM and compute are limited, and even the NPU has strict constraints on supported operations.
Because of this, models typically need to be quantised or simplified to fit.

Performance is usually expressed in tokens (words) per second.
Once converted to RKNN, parts of the model can run on the NPU, improving speed.

❗Showstopper❗

To process video input, individual frames are first extracted. The VLM converts each frame into embeddings, which are then transformed into vision tokens — as illustrated in the schema above.
Even on a desktop PC, this process places a heavy load on memory and CUDA resources. It’s therefore no surprise that the Rock 5C, with its more limited hardware, struggles even more.
The RKLLM library supports a maximum of 4092 tokens in total. Each vision token corresponds to about 200 tokens, limiting processing to 20 frames per video. To stay within this constraint, the video is subsampled, and evenly spaced frames are extracted for processing by Qwen3.

For reference, each vision token occupies around 20 MB of RAM — a detail worth keeping in mind when working on systems with limited memory.

Model performance benchmark (FPS)

All models below can handle multiple frames. Best performing is Qwen3, as the 'parent' of InternVL3.5 and SmolVLM2.

All LLM models are quantized to w8a8, while the VLM vision encoders use fp16.

model	RAM (GB)¹	llm cold sec²	llm warm sec³	vlm cold sec²	vlm warm sec³	Resolution	Tokens/s
Qwen3-2B	3.1	21.9	2.6	10.0	0.9	448 x 448	11.5
Qwen3-4B	8.7	49.6	5.6	10.6	1.1	448 x 448	5.7
InternVL3.5-1B	1.9	8.3	8.0	1.5	0.8	448 x 448	24
InternVL3.5-2B	3.0	22	8.0	2.7	0.8	448 x 448	11.2
InternVL3.5-4B	5.4	50	8.0	5.9	0.8	448 x 448	5
InternVL3.5-8B	8.8	92	8.0	50.5	5.8	448 x 448	3.5
SmolVLM2-2.2B	3.4	21.2	2.6	10.5	0.9	384 x 384	11
SmolVLM2-500M	0.8	4.8	0.7	2.5	0.25	384 x 384	31
SmolVLM2-256M	0.5	1.1	0.4	2.5	0.25	384 x 384	54

¹ The total used memory; LLM plus the VLM.
² When an llm/vlm model is loaded for the first time from your disk to RAM or NPU, it is called a cold start.
The duration depends on your OS, I/O transfer rate, and memory mapping.
³ Subsequent loading (warm start) takes advantage of the already mapped data in RAM. Mostly, only a few pointers need to be restored.

Dependencies.

To run the application, you have to:

OpenCV 64-bit installed.
rkllm library.
rknn library.
Optional: Code::Blocks. ($ sudo apt-get install codeblocks)

Installing the dependencies.

Start with the usual

$ sudo apt-get update 
$ sudo apt-get upgrade
$ sudo apt-get install cmake wget curl

OpenCV

To install OpenCV on your SBC, follow the Raspberry Pi 4 guide.

Or, when you have no intentions to program code:

$ sudo apt-get install libopencv-dev

Installing the app.

$ git clone https://github.com/Qengineering/Qwen3-VL-NPU-VIDEO

RKLLM, RKNN

To run InternVL3, you need to have the rkllm-runtime library version 1.2.3 (or higher) installed, as well as the rknpu driver version 0.9.8.
If you don't have these on your machine, or if you have a lower version, you need to install them.
We have provided the correct versions in the repo.

$ cd ./Qwen3-VL-2B-NPU/aarch64/library
$ sudo cp ./*.so /usr/local/lib
$ cd ../include
$ sudo cp ./*.h /usr/local/include

Download the LLM and VLM model.

The next step is downloading the models.
All models are on our Sync.com server. Please look at the repository of your choice for the appropriate links.
For instance, Qwen3-VL-2B:
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm and qwen3-vl-2b-vision_rk3588.rknn
Copy both into this folder.

Building the app.

Once you have the two models, it is time to build your application.
You can use Code::Blocks.

Load the project file *.cbp in Code::Blocks.
Select Release, not Debug.
Compile and run with F9.
You can alter command line arguments with Project -> Set programs arguments...

Or use Cmake.

$ mkdir build
$ cd build
$ cmake ..
$ make -j4

Running the app.

The app supports two modes:

Video.

./VLM_VIDEO_NPU RKNN_model RKLLM_model Video Frames

Argument	Comment
RKNN_model	The visual encoder model (vlm)
RKLLM_model	The large language model (llm)
Video	The video. Can be mp4, avi, mov, or mkv
Frames	Optional, default 8

Each frame requires about 2.24 seconds of VLM loading time.
Increasing the number of frames also extends the overall thinking time, since all corresponding vision tokens must be processed.

Files.

./VLM_VIDEO_NPU RKNN_model RKLLM_model File1 File2 File3 ... FileX

Argument	Comment
RKNN_model	The visual encoder model (vlm)
RKLLM_model	The large language model (llm)
File	The individual image file.

Each file requires about 2.24 seconds of VLM loading time.
Increasing the number of frames also extends the overall thinking time, since all corresponding vision tokens must be processed.

Using the app.

Using the application is simple. Once you provide the video and the models, you can ask everything you want.
Remember, we are on a bare Rock5C, so don't expect the same quality answers as ChatGPT can provide.
If you want to talk about the video, you need to include the token <image> in your prompt once.
The app remembers the dialogue until you give the token <clear>.
With <exit>, you leave the application.

Rock5GPT

To get a taste, try our professional Qwen3 AI-chatbot running on a Rock 5C: https://rock5gpt.qengineering.eu

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
include		include
src		src
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
frame1.jpg		frame1.jpg
frame2.jpg		frame2.jpg
frame3.jpg		frame3.jpg
frame4.jpg		frame4.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen3-VL NPU VIDEO

Qwen3 VLM VIDEO for RK3588 NPU (Rock 5, Orange Pi 5).

Introduction

❗Showstopper❗

Model performance benchmark (FPS)

Dependencies.

Installing the dependencies.

OpenCV

Installing the app.

RKLLM, RKNN

Download the LLM and VLM model.

Building the app.

Running the app.

Video.

Files.

Using the app.

Rock5GPT

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen3-VL NPU VIDEO

Qwen3 VLM VIDEO for RK3588 NPU (Rock 5, Orange Pi 5).

Introduction

❗Showstopper❗

Model performance benchmark (FPS)

Dependencies.

Installing the dependencies.

OpenCV

Installing the app.

RKLLM, RKNN

Download the LLM and VLM model.

Building the app.

Running the app.

Video.

Files.

Using the app.

Rock5GPT

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages