User: Describe the video.
Answer: The video shows a man performing a martial arts move in a large, open room with a light blue wall and green mat flooring.
- The man is wearing dark clothing.
- He begins by standing on his right foot, then performs a high kick to his left leg while simultaneously bending forward at the waist.
- As he executes this movement, he appears to be performing a martial arts technique that involves a combination of kicks and body control.
- After completing the move, he falls backwards onto the mat, landing on his back with his legs extended.
The video captures the man's dynamic motion as he performs the martial arts move.
Paper: Qwen3 Technical Report
Hugging face: https://huggingface.co/collections/Qwen/qwen3-vl
LLMs (Large Language Models) are neural networks trained on large text datasets to understand and generate language.
VLMs (Vision-Language Models) add a visual encoder so the model can process images and text together.
A combined VLM+LLM system is often referred to as a multimodal model.
These models can be large—hundreds of millions to billions of parameters—which impacts accuracy, memory use, and runtime speed.
On edge devices like the RK3588, available RAM and compute are limited, and even the NPU has strict constraints on supported operations.
Because of this, models typically need to be quantised or simplified to fit.
Performance is usually expressed in tokens (words) per second.
Once converted to RKNN, parts of the model can run on the NPU, improving speed.
To process video input, individual frames are first extracted. The VLM converts each frame into embeddings, which are then transformed into vision tokens — as illustrated in the schema above.
Even on a desktop PC, this process places a heavy load on memory and CUDA resources. It’s therefore no surprise that the Rock 5C, with its more limited hardware, struggles even more.
The RKLLM library supports a maximum of 4092 tokens in total. Each vision token corresponds to about 200 tokens, limiting processing to 20 frames per video. To stay within this constraint, the video is subsampled, and evenly spaced frames are extracted for processing by Qwen3.
For reference, each vision token occupies around 20 MB of RAM — a detail worth keeping in mind when working on systems with limited memory.
All models below can handle multiple frames. Best performing is Qwen3, as the 'parent' of InternVL3.5 and SmolVLM2.
All LLM models are quantized to w8a8, while the VLM vision encoders use fp16.
| model | RAM (GB)1 | llm cold sec2 | llm warm sec3 | vlm cold sec2 | vlm warm sec3 | Resolution | Tokens/s |
|---|---|---|---|---|---|---|---|
| Qwen3-2B | 3.1 | 21.9 | 2.6 | 10.0 | 0.9 | 448 x 448 | 11.5 |
| Qwen3-4B | 8.7 | 49.6 | 5.6 | 10.6 | 1.1 | 448 x 448 | 5.7 |
| InternVL3.5-1B | 1.9 | 8.3 | 8.0 | 1.5 | 0.8 | 448 x 448 | 24 |
| InternVL3.5-2B | 3.0 | 22 | 8.0 | 2.7 | 0.8 | 448 x 448 | 11.2 |
| InternVL3.5-4B | 5.4 | 50 | 8.0 | 5.9 | 0.8 | 448 x 448 | 5 |
| InternVL3.5-8B | 8.8 | 92 | 8.0 | 50.5 | 5.8 | 448 x 448 | 3.5 |
| SmolVLM2-2.2B | 3.4 | 21.2 | 2.6 | 10.5 | 0.9 | 384 x 384 | 11 |
| SmolVLM2-500M | 0.8 | 4.8 | 0.7 | 2.5 | 0.25 | 384 x 384 | 31 |
| SmolVLM2-256M | 0.5 | 1.1 | 0.4 | 2.5 | 0.25 | 384 x 384 | 54 |
1 The total used memory; LLM plus the VLM.
2 When an llm/vlm model is loaded for the first time from your disk to RAM or NPU, it is called a cold start.
The duration depends on your OS, I/O transfer rate, and memory mapping.
3 Subsequent loading (warm start) takes advantage of the already mapped data in RAM. Mostly, only a few pointers need to be restored.
To run the application, you have to:
- OpenCV 64-bit installed.
- rkllm library.
- rknn library.
- Optional: Code::Blocks. (
$ sudo apt-get install codeblocks)
Start with the usual
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install cmake wget curl
To install OpenCV on your SBC, follow the Raspberry Pi 4 guide.
Or, when you have no intentions to program code:
$ sudo apt-get install libopencv-dev
$ git clone https://github.com/Qengineering/Qwen3-VL-NPU-VIDEO
To run InternVL3, you need to have the rkllm-runtime library version 1.2.3 (or higher) installed, as well as the rknpu driver version 0.9.8.
If you don't have these on your machine, or if you have a lower version, you need to install them.
We have provided the correct versions in the repo.
$ cd ./Qwen3-VL-2B-NPU/aarch64/library
$ sudo cp ./*.so /usr/local/lib
$ cd ../include
$ sudo cp ./*.h /usr/local/includeThe next step is downloading the models.
All models are on our Sync.com server. Please look at the repository of your choice for the appropriate links.
For instance, Qwen3-VL-2B:
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm and qwen3-vl-2b-vision_rk3588.rknn
Copy both into this folder.
Once you have the two models, it is time to build your application.
You can use Code::Blocks.
- Load the project file *.cbp in Code::Blocks.
- Select Release, not Debug.
- Compile and run with F9.
- You can alter command line arguments with Project -> Set programs arguments...
Or use Cmake.
$ mkdir build
$ cd build
$ cmake ..
$ make -j4
The app supports two modes:
./VLM_VIDEO_NPU RKNN_model RKLLM_model Video Frames| Argument | Comment |
|---|---|
| RKNN_model | The visual encoder model (vlm) |
| RKLLM_model | The large language model (llm) |
| Video | The video. Can be mp4, avi, mov, or mkv |
| Frames | Optional, default 8 |
Each frame requires about 2.24 seconds of VLM loading time.
Increasing the number of frames also extends the overall thinking time, since all corresponding vision tokens must be processed.
./VLM_VIDEO_NPU RKNN_model RKLLM_model File1 File2 File3 ... FileX| Argument | Comment |
|---|---|
| RKNN_model | The visual encoder model (vlm) |
| RKLLM_model | The large language model (llm) |
| File | The individual image file. |
Each file requires about 2.24 seconds of VLM loading time.
Increasing the number of frames also extends the overall thinking time, since all corresponding vision tokens must be processed.
Using the application is simple. Once you provide the video and the models, you can ask everything you want.
Remember, we are on a bare Rock5C, so don't expect the same quality answers as ChatGPT can provide.
If you want to talk about the video, you need to include the token <image> in your prompt once.
The app remembers the dialogue until you give the token <clear>.
With <exit>, you leave the application.
To get a taste, try our professional Qwen3 AI-chatbot running on a Rock 5C: https://rock5gpt.qengineering.eu