SC3-Eval Evaluating Robot Foundation Models via Self-Consistent Video Generation

A self-consistent video generation recipe that adapts pre-trained robot video foundation models into faithful closed-loop policy evaluators.

Wei-Cheng Tseng1,2,3 Gashon Hussein4 Yuzhu Dong3 Allen Z. Ren4 Lucy X. Shi4,5 XuDong Wang4 Sergey Levine4,6 Zhaoshuo Li3 Jinwei Gu3 Florian Shkurti1,2,7 Ming-Yu Liu3 Quan Vuong4

1University of Toronto 2Vector Institute 3NVIDIA 4Physical Intelligence 5Stanford University 6UC Berkeley 7Allen Institute for AI

Self-Consistent Video Generation

SC3-Eval enforces forward-inverse dynamics consistency, multi-view consistency, and test-time consistency to keep imagined policy rollouts faithful over long horizons.

SC3-Eval consistency axes for forward-inverse dynamics, multi-view prediction, and test-time rollout termination
The three consistency axes of SC3-Eval: forward-inverse dynamics consistency reduces rollout drift, multi-view consistency keeps camera views coherent, and test-time consistency terminates off-manifold generations.

Abstract

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency: forward-inverse dynamics consistency, cross-view consistency, and test-time consistency. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of 0.929 and MMRV of 0.119, outperforming prior video-model-based baselines and generalizing to new tasks.

Policy Evaluation Results

SC3-Eval tracks real-world policy success rates across in-distribution and out-of-distribution table bussing tasks, including fine-grained success criteria.

Evaluation Setup

We evaluate table bussing with synchronized external and wrist camera views, and score each trajectory using three criteria: language following, object lifting, and object placing.

Real-world table bussing setup with forward and reverse task arrows, three camera views, and three success criteria
(a) Real-world experiment setup and table bussing task variants. (b) Three synchronized camera views observed by the policy and world model. (c) Example frames for the three success criteria: language following, object lifting, and object placing.

Correlation With Real-World Performance

The correlation plots compare predicted success rates against real-world success rates. The left two panels evaluate offline open-loop rollouts, where the world model is conditioned on the real-world action sequence. The right two panels evaluate online closed-loop rollouts, where the policy acts on generated frames. Each mode is split into the in-distribution table bussing task and an out-of-distribution reverse table bussing task, which keeps the same scene and motion primitives but swaps object destinations.

Policy evaluation correlation between predicted and ground-truth success rates
Each point is a policy-checkpoint and success-criterion pair. SC3-Eval maintains strong agreement with real-world policy performance across in-distribution and out-of-distribution tasks, and across offline and online evaluation.

Real-world rollout Offline world model rollout Online world model rollout

Pick up white bowl and put it in the blue trash can

Rollout Language following Object lifting Object placing
GTvvx
Offlinevvx
Onlinevvx

Pick up spoon and put it in the blue trash can

Rollout Language following Object lifting Object placing
GTvxx
Offlinevxx
Onlinevxx

Pick up plastic container and throw it away

Rollout Language following Object lifting Object placing
GTvvv
Offlinevvv
Onlinevvv

Pick up foil tray and throw it away

Rollout Language following Object lifting Object placing
GTvvv
Offlinevvv
Onlinevvv

Ablation Study

Cross-view consistency and inverse dynamics both contribute to evaluator fidelity.

Cross-View Consistency

We compare online world model rollouts with and without cross-view inpainting mode. When the robot moves outside of the workspace and returns, cross-view consistency helps the model better recover the scene within the workspace for the wrist camera.

w/o cross-view inpainting
w/ cross-view inpainting

Inverse Dynamics

We compare offline world model rollouts with and without inverse dynamics (ID) joint training. ID joint training prevents rollouts from drifting away from the ground-truth rollout.

GT w/o ID joint training w/ ID joint training

Uncertainty Estimation

Inverse-dynamics uncertainty acts as a test-time consistency signal for terminating unreliable imagined rollouts.

Prompt: pick up blue bowl and put it in bin

Loading uncertainty visualization...

Prompt: pick up chip bag and throw it away

Loading uncertainty visualization...