Object-centric 3D Motion Field
for Robot Learning from
Human Videos

Zhao-Heng Yin, Sherry Yang, and Pieter Abbeel
UC Berkeley EECS and Google DeepMind

Object-centric 3D Motion Field
for Robot Learning from Human Videos

Zhao-Heng Yin, Sherry Yang, and Pieter Abbeel
UC Berkeley EECS and Google DeepMind

See Motion. Learn Motion.

Motion is a fundamental element of control. We present a learning framework that extracts object motion knowledge (representation) from human videos and uses it to teach robots to perform purposeful tasks—all without requiring robot demonstrations.

Full width image

We use an image-shaped, dense 3D motion field to represent object movement. Unlike sparse keypoints, it is informative and keeps full motion; unlike point cloud flows, it is structured and ready for generative modeling; and unlike 3D pose, it is general and avoid object models.

See the Order in Chaos (Phase I)

Sensors are noisy, but we uncover order within the chaos. By leveraging simulation, we build a motion denoiser to reveal smooth motion hidden in the noisy pixel flow and depth.

Full width image

The Data Generation Process

Full width image

Infinite Variations of a Cube

Model Architecture

A plain and minimal model is sufficient.

Full width image

Key Design: Intrinsics Map

Recall the Ideal Pixel \((x,y)\) and 3D \((X,Y,Z)\) Position Relation

\( X = (x-c_x)Z/f_x, \)

\( dX = (x-c_x)/f_x dZ + (Z/f_x) dx. \)

\(dX\) is our target, 3D motion in X direction.

We only have \(dZ, dx, Z\) as input,

and we also need \((x-c_x)/f_x, 1/f_x\).

All these form the intrinsics map.

Do not forget \(1/f_x\), inverse focal length.

Master the Movement (Phase II)

We train a diffusion model to generate the object motion field observed in each human videos. Assuming a firm grasp, the motion field can be translated to robot actions directly.

Results

We meaure the motion reconstruction accuracy in the real world. Our method achives lower SE(3) motion reconstruction error and produces smoother motion representation for policy learning. See the paper for details.

Full width image

Concluding Remark

To learn from video, we must first understand it. We take an initial step toward mining motion in videos for robotic control, demonstrating that video-based policy learning can also handle high-precision tasks. While challenges remain—such as occlusion, deformable objects, and complex interactions—this work marks a step forward. We're just getting started.

Acknowledgement

This work is part of the Google-BAIR Commons project. The authors gratefully acknowledge Google for providing computational resources. Zhao-Heng Yin is supported by the ONR MURI grant N00014-22-1-2773 at UC Berkeley. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This research was conducted at UC Berkeley and is not affiliated with Amazon.