Skip to content

video-reality-test/video-reality-test

Repository files navigation

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Updates

  • 2025-12-15: Our paper is available on arxiv.
  • 2025-12-15: We update the data source.
  • 2025-12-15: We release the video reality test repo.

1. Brief Introduction

We introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio–visual coupling, featuring the following dimensions:

(i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action–object interactions with diversity across objects, actions, and backgrounds.

(ii) Peer-Review evaluation. An adversarial creator–reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness.

2. Todo List

  • Public paper
  • Public real & AI-generated ASMR dataset (hard)
  • Public real & AI-generated ASMR dataset (easy)
  • Public video understanding evaluation code
  • Publish video generation code
  • Adaptation to the dataset download formats following @NielsRogge's issue

3. Dataset Introduction

  1. We release the real ASMR corpus with a total of 149 (100 hard level + 49 easy):
    • real videos (Real_ASMR/videos),
    • extracted images (Real_ASMR/pictures),
    • and prompts for hard level (Real_ASMR_Prompt.csv: ref is the image path, text is the prompt).
  2. We release the AI-generated hard level ASMR videos from 13 different video-generation settings with a total of 100 x 13:
    • OpenSoraV2 (i2v, t2v, it2v),
    • Wan2.2 (A14B-i2v, A14B-t2v, 5B-it2v),
    • Sora2 variants (i2v, t2v) (w/o, w/ watermark),
    • Veo3.1-fast (i2v),
    • Diffsynth-Studio Hunyuan (i2v, t2v) / StepFun (t2v),
  3. We therefore provide 1 + k clips (with k = 13 fakery families), enabling fine-grained studies of how creators vary while sharing identical textual grounding.

We give the dataset folders in HuggingFace, the folders and the compressed files Video_Reality_Test.tar.gz in ModelScope. The layout below shows how the data is organized once Video_Reality_Test.tar.gz is unpacked.

Layout

  • Video_Reality_Test.tar.gz — monolithic archive containing every real video, generated video, and metadata file. Use tar -xzf Video_Reality_Test.tar.gz to recreate the folder layout described below.
  • Folder layout (already unpacked on the ModelScope repo) mirrors the archive so you can rsync individual generators without downloading the full tarball.
Video_Reality_Test/
├── HunyuanVideo/           # Diffsynth-Studio → Hunyuan generations
├── OpensoraV2/             # OpenSora V2 baselines
├── Real_ASMR/              # real ASMR hard level reference videos (+optional keyframes)
|   ├── videos/
|   ├── pictures/
├── Real_ASMR_Prompt.csv    # prompt sheet for hard level; ref=video filename, text=description
├── Real_ASMR_easy/         # real ASMR easy level reference videos (+optional keyframes)
|   ├── videos/
|   ├── pictures/
|   ├── prompt.csv          # prompt sheet for easy level; ref=video filename, text=description
├── Fake_ASMR_easy/         # Fake ASMR easy level reference videos
|   ├── opensora/           # opensora image-to-video outputs
|   ├── opensora_woprompt/  # opensora image-to-video outputs without prompt
|   ├── wan/                # wan image-to-video outputs
|   ├── wan_woprompt/       # wan image-to-video outputs without prompt
|   ├── prompt.json/        # prompt sheet
├── Sora2-it2v/             # Sora2 image-to-video outputs
│── Sora2-it2v-wo-watermark/# watermark-free variant of the above
│── Sora2-t2v/              # Sora2 text-to-video runs
│── StepVideo-t2v/          # Diffsynth-Studio → StepFun generations
├── Veo3.1-fast/            # Veo 3.1 fast generations
├── Wan2.2/                 # Wan 2.2 outputs
└── ...

Every generator-specific directory contains clips named after their prompt IDs so you can align them with Real_ASMR_Prompt.csv for hard level and prompt.csv for easy level .

4. Generation Setup

Unless otherwise noted, we kept the native sampler settings of each platform so downstream evaluators see the exact outputs human raters inspected.

5. Run the Evaluation Code

  1. Clone only the evaluation code:

    git clone https://github.com/video-reality-test/video-reality-test.git

    Clone the evaluation code and video generation submodules:

    git clone --recurse-submodules https://github.com/video-reality-test/video-reality-test.git

    Note: If you have git cloned the evaluation code, run git submodule update --init --recursive for cloning submodules additionally.

  2. Install dependencies:

    conda create -n vrt python=3.10 -y
    conda activate vrt
    pip install -r requirements.txt
  3. Download a dataset split (choose one link at the top) and extract it under data/. Update the data_path in eval_judgement.py and eval_judgement_audio.py so the scripts can locate the unpacked files.

  4. Open eval_judgement.py and eval_judgement_audio.py, set the required API key/token variables and MODEL_NAME placeholders at the top of each file to match the provider you are evaluating. Without this step the scripts will exit immediately:

    api_key = "your_api_key_here"
    model_name = "gemini-2.5-flash"

    Additionaly, set your evaluation dataset path {/path/to/judgement/dataset/}/xxx.mp4, and your results save path as follows:

    # save results path
    save_path_root = f"save/path/root/{model_name}/"
    # test data path
    data_path = "/path/to/judgement/dataset/"
  5. Launch the evaluators:

    # video reality test for visual only
    python eval_judgement.py 
    
    # video reality test for visual+audio
    # NOTE: multi-modal (image+text+audio) inputs currently only work with Gemini 2.5 Pro or Gemini 2.5 Flash APIs.
    python eval_judgement_audio.py 

The video understanding peer-review results are as follows, where the gemini-3-preview is the best model (detailes refer to our paper):

6. Citation

Please cite the video reality test paper when using this benchmark:

@misc{wang2025videorealitytestaigenerated,
      title={Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?}, 
      author={Jiaqi Wang and Weijia Wu and Yi Zhan and Rui Zhao and Ming Hu and James Cheng and Wei Liu and Philip Torr and Kevin Qinghong Lin},
      year={2025},
      eprint={2512.13281},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.13281}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages