AIAIMachine LearningPerceptionPostsRoboticsSensors

DreamDojo: Scaling Robot World Models with 44,000+ Hours of Egocentric Human Video

TL;DR
DreamDojo pretrains a robot world model on 44,711 hours of egocentric human video using continuous latent actions as proxy supervision. After robot-specific post-training and autoregressive distillation, the model demonstrates improved physics realism, action controllability, and real-time rollout generation (~10.81 FPS).

World models — generative models that predict how an environment evolves under actions — are increasingly viewed as a core ingredient for scalable robotics. If a robot can reliably simulate what happens next, it can evaluate policies without expensive deployment, perform model-based planning, and even support interactive teleoperation. Yet robot world modeling faces a persistent bottleneck: high-quality robot interaction data is costly and limited in diversity, while robot action spaces are continuous, high-dimensional, and contact-rich.

DreamDojo proposes a data and learning strategy that shifts the scaling axis away from robot-only datasets:
– pretrain a foundation world model on
44,711 hours of egocentric human videos,
– then adapt (“post-train”) to target robot embodiments using comparatively small robot datasets.
The key technical challenge
is that large-scale human video data are typically unlabeled with respect to robot actions, and DreamDojo addresses this by introducing continuous latent actions as proxy controls.

What DreamDojo is trying to build: an interactive world model

An interactive world model predicts future states conditioned on actions, typically formalized as learning a transition distribution:

st+1 ∼ p( · | st, at )

In DreamDojo, the “state” is represented as video frames (or latents of video frames). The model’s output is a plausible future video rollout that respects physical interactions and follows action inputs.

Prior video world models often achieve impressive open-ended prediction, but they can struggle with robotics requirements:
(i) continuous control rather than discrete actions, (ii) contact-rich manipulation, and (iii) robustness under out-of-distribution objects and environments.

The scaling move: pretraining from a large-scale egocentric human video

DreamDojo’s core bet is that human video contains broad physical interaction coverage that is difficult to capture with robots at scale. While humans and robots differ in embodiment, the underlying physics — gravity, contact causality, object permanence, sliding, rolling, deformation — is shared. This motivates pretraining on a large mixture of egocentric human datasets, including:
In-lab: controlled tabletop data collected with additional tracking gear to validate ideas and enable retargeting.
EgoDex: a public egocentric dexterous manipulation dataset recorded with the Apple Vision Pro and high-precision hand-pose signals.
– DreamDojo-HV: a large in-house crowdsourced dataset covering diverse environments (household, industrial, retail, etc.) and a wide range of tasks.

The paper emphasizes that scale and diversity matter together: long-horizon tasks, many scenes, and many object categories increase the coverage of interaction dynamics and the stochasticity of outcomes.

The key technical problem: human videos do not come with robot action labels

World models for robotics are not just “future predictors” — they must learn action-conditioned causality. However, large-scale human video usually lacks structured action labels (joint angles, end-effector commands, torques). If you pretrain only by passive prediction (“action-free”),  the model can learn some physics but often transfers less effectively to action-controllable robot simulation.

DreamDojo addresses this challenge by introducing continuous latent actions — a unified proxy action representation learned directly from videos in a self-supervised manner. Rather than relying on explicit action labels, the model infers a compact latent vector that captures the essential transformation between consecutive frames. In practice, this vector encodes what changed in the scene, while disentangling motion-related information from static visual context.

Latent actions (proxy supervision) — what does it mean?
A latent action model (implemented as a VAE with a spatiotemporal Transformer backbone) takes a pair of consecutive frames and produces a low-dimensional embedding â(t). This embedding is trained such that, combined with the earlier frame, it can reconstruct the later frame — forcing the embedding to capture the most salient motion and interaction information.
In plain terms: DreamDojo builds a “hidden action label” from pixels. That hidden label is then used like an action input when training the world model.
Why it helps: it injects causality into pretraining without requiring expensive action instrumentation at human-video scale.

DreamDojo’s training recipe: pretrain → post-train → distill

Phase A — Pretraining from human video with latent actions
DreamDojo starts from a pretrained video diffusion backbone (Cosmos-Predict2.5) and augments it with latent action conditioning. Rather than feeding the entire action sequence at once, actions are injected in short temporal chunks aligned with the model’s latent frame rate. This prevents the model from accidentally using future actions to predict past frames — a phenomenon known as action leakage. Chunked conditioning, therefore, enforces a cleaner cause-and-effect relationship between actions and predicted futures.

Phase B — Post-training on target robots (embodiment adaptation)
After pretraining on human videos, the model is adapted to the target robot’s control space (for example, humanoid joint commands). To do this, DreamDojo reinitializes the action conditioning layer and finetunes the model using a comparatively small robot dataset. Because the pretraining phase has already instilled broad physical and interaction knowledge, this adaptation step requires far less robot data while still enabling strong generalization to new objects and environments.

Phase C — Distillation for real-time, autoregressive interaction
Diffusion-based video generation is powerful but often computationally expensive. To enable interactive use, DreamDojo introduces a distillation pipeline inspired by Self Forcing, where a high-quality “teacher” model is compressed into a faster autoregressive “student.” The student employs causal attention and requires far fewer denoising steps, significantly accelerating inference. The resulting model achieves near real-time performance (≈10.81 FPS versus ≈2.72 FPS for the teacher) while also improving stability over long prediction horizons.

Design choices aimed at continuous controllability

Beyond incorporating action conditioning, DreamDojo introduces several design choices aimed at improving action adherence and simulation fidelity under continuous robot control:

Relative actions: Actions are represented as short-horizon deltas with respect to a reference pose. This reduces variability in the action space and improves compositional generalization.
Chunked action injection: Instead of conditioning on the entire action trajectory, only the temporally relevant action segment is injected into each latent frame, preventing causality ambiguities.
Temporal consistency objective: A transition-aware loss augments the standard flow matching objective, promoting smoother dynamics and reducing visual artifacts.

Ablation studies show that these components collectively improve performance across both expert demonstrations and counterfactual scenarios. This suggests that enhanced controllability arises not merely from better visual quality, but from more structured and causally consistent action conditioning.

Evaluation strategy: out-of-distribution and counterfactual benchmarks

DreamDojo is evaluated in settings designed to stress generalization beyond standard robot training distributions. The paper describes multiple evaluation sets mirroring human dataset scenarios but performed with a target robot, as well as explicit counterfactual trajectories (e.g., reaching but missing an object). This is important because a world model useful for planning must respond meaningfully when actions deviate from demonstrations.

The paper reports automatic metrics (PSNR/SSIM/LPIPS) and human preference studies for novel-background sets where ground-truth rollouts are unavailable. A key reported trend is that more diverse human data improves both physics realism and action following.

Downstream applications: why a real-time world model matters

Policy evaluation (simulated rollouts for policy assessment)
A primary application of DreamDojo is evaluating robot policies without requiring extensive real-world deployment. Given an initial observation, the model simulates future rollouts conditioned on a policy’s predicted actions. The estimated task outcomes can then be compared with real-world performance. The results reported in the paper indicate strong agreement in both policy ranking and success prediction trends, suggesting that a learned world model can serve as a practical and reliable evaluation tool.

Model-based planning (simulate candidate actions, select the best)
DreamDojo also enables test-time planning by predicting future trajectories for multiple candidate action sequences. An external value model evaluates these imagined rollouts and selects the most promising one for execution. This produces a simple yet effective planning loop: propose → simulate → score → execute, allowing policies to anticipate consequences before acting.

Live teleoperation (real-time interactive prediction)
Through autoregressive distillation, DreamDojo achieves real-time inference, generating future frames at interactive speeds. This capability supports live teleoperation scenarios in which human control inputs are streamed to the model, enabling continuous preview of predicted outcomes and improving operator awareness.

Limitations and open research questions

The paper explicitly notes several limitations that are also useful pointers for future work:
– Uncommon fast actions (e.g., slapping or rapid waving) remain challenging.
– Failure realism: simulated success rates can be higher than real-world outcomes, suggesting missing nuance in failure modes.
– Multi-view simulation is not natively supported, although multi-view is increasingly important for state-of-the-art policies.
– Adaptation/retention trade-offs: retaining pretrained knowledge during embodiment-specific finetuning remains an open problem.

Conclusion

DreamDojo outlines a compelling path toward foundation world models for robotics by shifting pretraining to a domain where scale is naturally abundant: egocentric human video. Its introduction of continuous latent actions provides a practical solution to the missing-label problem, effectively transforming passive video data into action-conditioned supervision. When combined with embodiment-specific post-training and autoregressive distillation for real-time inference, DreamDojo illustrates how generative world models can evolve into interactive tools for teleoperation, policy evaluation, and model-based planning in open-world, contact-rich scenarios.

References

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.