Object-centric 3D Motion Field
for Robot Learning from
Human Videos
Zhao-Heng Yin, Sherry Yang, and Pieter Abbeel
UC Berkeley EECS and Google DeepMind
Object-centric 3D Motion Field
for Robot Learning from Human Videos
Zhao-Heng Yin, Sherry Yang, and Pieter Abbeel
UC Berkeley EECS and Google DeepMind
See Motion. Learn Motion.
Motion is a fundamental element of control. We present a learning framework that extracts object motion knowledge (representation) from human videos and uses it to teach robots to perform purposeful tasks—all without requiring robot demonstrations.

We use an image-shaped, dense 3D motion field to represent object movement. Unlike sparse keypoints, it is informative and keeps full motion; unlike point cloud flows, it is structured and ready for generative modeling; and unlike 3D pose, it is general and avoid object models.
See the Order in Chaos (Phase I)
Sensors are noisy, but we uncover order within the chaos. By leveraging simulation, we build a motion denoiser to reveal smooth motion hidden in the noisy pixel flow and depth.

The Data Generation Process

Infinite Variations of a Cube
Model Architecture
A plain and minimal model is sufficient.

Key Design: Intrinsics Map
Recall the Ideal Pixel \((x,y)\) and 3D \((X,Y,Z)\) Position Relation
\( X = (x-c_x)Z/f_x, \)
\( dX = (x-c_x)/f_x dZ + (Z/f_x) dx. \)
\(dX\) is our target, 3D motion in X direction.
We only have \(dZ, dx, Z\) as input,
and we also need \((x-c_x)/f_x, 1/f_x\).
All these form the intrinsics map.
Do not forget \(1/f_x\), inverse focal length.
Master the Movement (Phase II)
We train a diffusion model to generate the object motion field observed in each human videos. Assuming a firm grasp, the motion field can be translated to robot actions directly.
Results
We meaure the motion reconstruction accuracy in the real world. Our method achives lower SE(3) motion reconstruction error and produces smoother motion representation for policy learning. See the paper for details.

Concluding Remark
To learn from video, we must first understand it. We take an initial step toward mining motion in videos for robotic control, demonstrating that video-based policy learning can also handle high-precision tasks. While challenges remain—such as occlusion, deformable objects, and complex interactions—this work marks a step forward. We're just getting started.
Acknowledgement
This work is part of the Google-BAIR Commons project. The authors gratefully acknowledge Google for providing computational resources. Zhao-Heng Yin is supported by the ONR MURI grant N00014-22-1-2773 at UC Berkeley. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This research was conducted at UC Berkeley and is not affiliated with Amazon.