- 2025-12-15: Our paper is available on arxiv.
- 2025-12-15: We update the data source.
- 2025-12-15: We release the video reality test repo.
We introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio–visual coupling, featuring the following dimensions:
(i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action–object interactions with diversity across objects, actions, and backgrounds.
(ii) Peer-Review evaluation. An adversarial creator–reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness.
- Public paper
- Public real & AI-generated ASMR dataset (hard)
- Public real & AI-generated ASMR dataset (easy)
- Public video understanding evaluation code
- Publish video generation code
- Adaptation to the dataset download formats following @NielsRogge's issue
- We release the real ASMR corpus with a total of 149 (100 hard level + 49 easy):
- real videos (
Real_ASMR/videos), - extracted images (
Real_ASMR/pictures), - and prompts for hard level (
Real_ASMR_Prompt.csv:refis the image path,textis the prompt).
- real videos (
- We release the AI-generated hard level ASMR videos from 13 different video-generation settings with a total of 100 x 13:
- OpenSoraV2 (i2v, t2v, it2v),
- Wan2.2 (A14B-i2v, A14B-t2v, 5B-it2v),
- Sora2 variants (i2v, t2v) (w/o, w/ watermark),
- Veo3.1-fast (i2v),
- Diffsynth-Studio Hunyuan (i2v, t2v) / StepFun (t2v),
- We therefore provide
1 + kclips (withk = 13fakery families), enabling fine-grained studies of how creators vary while sharing identical textual grounding.
We give the dataset folders in HuggingFace, the folders and the compressed files Video_Reality_Test.tar.gz in ModelScope.
The layout below shows how the data is organized once Video_Reality_Test.tar.gz is unpacked.
Video_Reality_Test.tar.gz— monolithic archive containing every real video, generated video, and metadata file. Usetar -xzf Video_Reality_Test.tar.gzto recreate the folder layout described below.- Folder layout (already unpacked on the ModelScope repo) mirrors the archive so you can rsync individual generators without downloading the full tarball.
Video_Reality_Test/
├── HunyuanVideo/ # Diffsynth-Studio → Hunyuan generations
├── OpensoraV2/ # OpenSora V2 baselines
├── Real_ASMR/ # real ASMR hard level reference videos (+optional keyframes)
| ├── videos/
| ├── pictures/
├── Real_ASMR_Prompt.csv # prompt sheet for hard level; ref=video filename, text=description
├── Real_ASMR_easy/ # real ASMR easy level reference videos (+optional keyframes)
| ├── videos/
| ├── pictures/
| ├── prompt.csv # prompt sheet for easy level; ref=video filename, text=description
├── Fake_ASMR_easy/ # Fake ASMR easy level reference videos
| ├── opensora/ # opensora image-to-video outputs
| ├── opensora_woprompt/ # opensora image-to-video outputs without prompt
| ├── wan/ # wan image-to-video outputs
| ├── wan_woprompt/ # wan image-to-video outputs without prompt
| ├── prompt.json/ # prompt sheet
├── Sora2-it2v/ # Sora2 image-to-video outputs
│── Sora2-it2v-wo-watermark/# watermark-free variant of the above
│── Sora2-t2v/ # Sora2 text-to-video runs
│── StepVideo-t2v/ # Diffsynth-Studio → StepFun generations
├── Veo3.1-fast/ # Veo 3.1 fast generations
├── Wan2.2/ # Wan 2.2 outputs
└── ...
Every generator-specific directory contains clips named after their prompt IDs so you can align them with Real_ASMR_Prompt.csv for hard level and prompt.csv for easy level .
OpenSoraV2(https://github.com/hpcaitech/Open-Sora) provided most baseline trajectories.Wan2.2(https://github.com/Wan-Video/Wan2.2) complemented cinematic scenes needing richer lighting.Diffsynth-Studiogenerated bothHunyuanandStepFunvariants from identical prompts to compare vendor-specific biases.Sora 2clips were authored via the official portal at https://openai.com/sora.- We use this website: https://magiceraser.org/zh/sora-watermark-remover/ to remove the watermarks on
Sora 2generated videos.
- We use this website: https://magiceraser.org/zh/sora-watermark-remover/ to remove the watermarks on
Veo 3.1 fastgenerations came from Google's preview interface at https://deepmind.google/technologies/veo/.
Unless otherwise noted, we kept the native sampler settings of each platform so downstream evaluators see the exact outputs human raters inspected.
-
Clone only the evaluation code:
git clone https://github.com/video-reality-test/video-reality-test.git
Clone the evaluation code and video generation submodules:
git clone --recurse-submodules https://github.com/video-reality-test/video-reality-test.git
Note: If you have git cloned the evaluation code, run
git submodule update --init --recursivefor cloning submodules additionally. -
Install dependencies:
conda create -n vrt python=3.10 -y conda activate vrt pip install -r requirements.txt
-
Download a dataset split (choose one link at the top) and extract it under
data/. Update thedata_pathineval_judgement.pyandeval_judgement_audio.pyso the scripts can locate the unpacked files. -
Open
eval_judgement.pyandeval_judgement_audio.py, set the required API key/token variables andMODEL_NAMEplaceholders at the top of each file to match the provider you are evaluating. Without this step the scripts will exit immediately:api_key = "your_api_key_here" model_name = "gemini-2.5-flash"
Additionaly, set your evaluation dataset path
{/path/to/judgement/dataset/}/xxx.mp4, and your results save path as follows:# save results path save_path_root = f"save/path/root/{model_name}/" # test data path data_path = "/path/to/judgement/dataset/"
-
Launch the evaluators:
# video reality test for visual only python eval_judgement.py # video reality test for visual+audio # NOTE: multi-modal (image+text+audio) inputs currently only work with Gemini 2.5 Pro or Gemini 2.5 Flash APIs. python eval_judgement_audio.py
The video understanding peer-review results are as follows, where the gemini-3-preview is the best model (detailes refer to our paper):
Please cite the video reality test paper when using this benchmark:
@misc{wang2025videorealitytestaigenerated,
title={Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?},
author={Jiaqi Wang and Weijia Wu and Yi Zhan and Rui Zhao and Ming Hu and James Cheng and Wei Liu and Philip Torr and Kevin Qinghong Lin},
year={2025},
eprint={2512.13281},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.13281},
}