CRAFT: Video Diffusion for Bimanual Robot Data Generation

International Conference on Intelligent Robots and Systems (IROS)

Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita

University of Southern California

arXiv

Paper Code (Coming Soon)

SCROLL

Method Videos

A quick tour of CRAFT generations across six augmentation axes.

Jump to Method

Object pose · Stack bowls

Lighting · Stack bowls

Cross-embodiment · Stack bowls

Object pose · Lift roller

Object pose · Place cans

Canny-edge conditioning

Beyond Franka · Background

Background · Stack bowls

Object color · Stack bowls

Wrist + 3rd · Stack bowls

Object color · Lift roller

Background · Place cans

Object distractors

Different reference image

TL;DR

CRAFT turns just a couple of real robot demonstrations into a large, visually diverse training set by generating photorealistic robot videos with a video diffusion model.

The problem

Real robot data is expensive to collect and lacks visual variation, so policies struggle under new lighting, backgrounds, object positions, or camera views.

The idea

Guide a video diffusion model with canny-edges from simulation to synthesize new, action-labeled demonstrations across six augmentation axes.

Method

Figure 2: CRAFT pipeline. (1) Trajectory Expansion via Real2Sim digital twin. (2) Video Generation with Canny-edge conditioning. (3) Augmented Dataset Construction across six axes. (4) Generated Dataset for policy training.

Trajectory Expansion

Real-world teleoperation data is collected, and a digital twin pipeline transfers objects and robot into simulation (Real2Sim) for large-scale data generation.

Video Generation

Simulation trajectories are rendered into source videos and converted into Canny-edge controls, then combined with a reference image + language instruction to condition a video diffusion model.

Augmented Dataset Construction

Generated videos cover six variation axes: object pose, lighting, object color, background, cross-embodiment, and wrist + third-person views.

Generated Dataset

Synthesized videos are paired with action labels from simulation trajectories to produce action-consistent demonstrations \(\mathcal{D}^{\text{gen}}\) for downstream ACT policy training.

Why Canny-Edges?

Raw Simulation Images

Retain too much low-level detail, causing the diffusion model to struggle with salient structural features such as gripper-object contact.

Canny-Edge Representations

Discard irrelevant details while preserving robot arm and object structure, giving clear guidance and allowing free variation of backgrounds, object colors, and lighting through prompting.

Stack Two Bowls: Canny-edge conditioned generation

The video above was produced by the video generation model using the language instruction below (together with Canny-edge control).

Experiment Setup

Video Generation

Generate photorealistic, action-consistent robot videos from simulator rollouts using a pre-trained diffusion model.

Wan2.1-Fun-Control 1.3B Canny-edge control 512×512 input

Policy Training & Evaluation

Train ACT on real + generated demonstrations and evaluate robustness under controlled distribution shifts.

ACT Policy RoboTwin Benchmark Success rate (%)

Core setup

Model

Wan2.1-Fun-Control 1.3B

Input

512×512

Policy

ACT

Benchmark

RoboTwin

Tasks

Lift Roller

Coordinated bimanual task where both arms simultaneously grasp and lift.

Place Cans in Plasticbox

Parallel task where both arms independently pick up cans and place them into a container.

Stack Two Bowls

Sequential task where two bowls must be stacked on top of each other in order.

Real-World Augmentation Results

Success rates (%). Each method evaluated under test conditions varying only along that dimension. CRAFT (Ours) uses 1000 generated demos + real-world collected demos. Cross-Embodiment: xArm7 → Franka Panda transfer.

Lighting

Background

Camera View

Object Color

Wrist + 3rd Person

Cross-Embodiment

Real-World Video Rollouts

Policy rollouts on physical hardware for each augmentation type. Policies are trained with CRAFT-generated data and evaluated under the corresponding test condition.

Stack Two Bowls

Lift Roller

Place Cans In Plasticbox

Stack Two Bowls

Lift Roller

Place Cans In Plasticbox

Stack Two Bowls

Lift Roller

Place Cans In Plasticbox

Cross-Embodiment: xArm7 demos transferred to Franka

Stack Two Bowls

Lift Roller

Place Cans In Plasticbox

Wrist + 3rd Person: Multi-view policy execution

Stack Two Bowls

Lift Roller

Place Cans In Plasticbox

Camera View: Alternate camera viewpoints

Stack Two Bowls

Lift Roller

Place Cans In Plasticbox

Rollouts shown for three tasks (Stack Two Bowls, Lift Roller, Place Cans In Plasticbox) under each real-world augmentation setting.

Object Pose Generation

For each trajectory, the simulator applies random translations and rotations to the target object's pose, sampled from a uniform distribution over the physically feasible workspace.

Sample 1

Sample 2

Sample 1

Sample 2

Sample 1

Sample 2

Lighting Generation

We generate diverse lighting conditions by prompting Veo3 to synthesize variants of the reference image under different ambient illumination (e.g., blue or green lighting). Unlike simple color jitter, this preserves scene properties like shadows and surface reflections.

Sample 1

Sample 2

Sample 1

Sample 2

Sample 1

Sample 2

Object Color Generation

To generate diverse object colors, the model conditions on a reference image of the empty table scene, allowing the language instruction to freely specify the desired color while Canny-edge control provides object contours and location.

Sample 1

Sample 2

Sample 1

Sample 2

Sample 1

Sample 2

Background Generation

To generate diverse backgrounds, we omit the reference image from the video diffusion model, conditioning on it would anchor the generated scene to the original environment. Instead, we modify the language instruction to describe the desired background.

Sample 1

Sample 2

Sample 1

Sample 2

Sample 1

Sample 2

Cross-Embodiment Generation

We enable cross-embodiment transfer by retargeting source-robot demonstrations to a target robot using forward and inverse kinematics, mapping end-effector poses to new joint configurations while preserving gripper actions. In our setup the source robot is the xArm7 and the target is the Franka Panda; we generate photorealistic videos for the target robot only, so xArm7 source demonstrations are not shown here. We plan to add videos of the real-world xArm7 demonstrations in a future update.

Sample 1

Sample 2

Sample 1

Sample 2

Sample 1

Sample 2

Wrist + 3rd Person View Generation

We tile the left wrist camera, right wrist camera, and third-person (external) camera into a single image. Tiling ensures spatial consistency across all viewpoints, enabling multi-view policy training without collecting real wrist-camera data.

Stress Testing Video Generation Model

Additional stress tests of our video generation: generation beyond the Franka platform, with object distractors, and with different reference images. Click a tab to view each category.

Generation Beyond Franka

Examples of video generation beyond the default Franka setup: different robot arms (e.g. single-arm xArm7), backgrounds, and object appearances, alongside the original generation for comparison.

Single arm xArm7 with ocean background.

Pink generation object.

Original generation.

Object Distractor

Shows how we can generate random objects on the surface.

Object distractor example

Different Reference Image

Our reference image can be anything as it doesn’t have to be a black curtain. We removed the black curtain and show it's generation results. Below: the reference image given to the model and the generated video.

Reference image (input to model)

Generated video

BibTeX

@inproceedings{chen2026craft,
  title={{CRAFT: Video Diffusion for Bimanual Robot Data Generation}},
  author={Jason Chen and I-Chun Arthur Liu and Gaurav Sukhatme and Daniel Seita},
  booktitle={International Conference on Intelligent Robots and Systems (IROS)},
  Year={2026}
}

Page videos

Scanning clips…

Media keeps loading in the background as you scroll. You can close this anytime.