DM0 Redefines Physical AI via Embodied Training

Breaking News Technology
Close-up of a sleek robotic hand with intricate sensors reaching toward a metallic object in a dramatic high-tech setting.
4K Quality
Traditional robotic AI often struggles because it is adapted from models trained primarily on internet text rather than the physical world. The new DM0 framework reverses this trend by training a Vision-Language-Action model on physical priors from the very beginning, enabling robots to navigate and reason simultaneously.

Physical AI has reached a pivotal turning point with the introduction of DM0, a vision-language-action (VLA) framework that integrates physical laws and spatial reasoning from its very inception. Unlike previous models that were adapted from internet text and images, Hao Liu, Bin Xie, and Yi Yang have developed a system that treats physical interaction as a primary data source rather than a fine-tuning afterthought. This "embodied-native" approach allows robots to navigate complex environments and manipulate objects with a level of precision that mirrors biological learning, bridging the long-standing gap between digital reasoning and real-world execution.

How does DM0 differ from traditional vision-language-action models?

DM0 differs from traditional VLA models by incorporating intrinsic multi-source physical priors from the onset of training, rather than relying on fine-tuning internet-pretrained models. By utilizing a hybrid training strategy and a flow-matching action expert, DM0 preserves generalized semantic representations while simultaneously mastering the high-frequency control required for complex robotic tasks, effectively outperforming benchmarks like π0.

Traditional robotic AI often struggles because it is adapted from models trained primarily on internet text rather than the physical world. These "internet-first" models lack an inherent understanding of spatial intelligence, leading to "hallucinations" in physical movement where a robot might understand the command "pick up the cup" but fail to grasp the torque or trajectory required to do so. In contrast, DM0 is an embodied-native model. This means it is built to understand physical grounding—the relationship between visual input, linguistic commands, and motor output—as a single, unified language of action.

The Concept of Embodied-Native Intelligence in Physical AI

Embodied-native intelligence refers to a paradigm where an AI model learns the fundamental laws of physics and spatial relationships concurrently with semantic language data. This approach moves beyond passive observation, where a model merely watches videos or reads descriptions, to active physical grounding. By training on heterogeneous data sources including autonomous driving logs and robotic interaction data, DM0 develops a "common sense" for the physical world that internet-only models cannot replicate.

The research team argues that fine-tuning internet models for physics is insufficient for complex tasks because the underlying architecture is not optimized for low-level control. DM0 addresses this by integrating spatial knowledge from diverse corpora. For example, by including autonomous driving scenarios, the model learns the dynamics of movement and obstacle avoidance at scale. These physical priors act as a scaffold, allowing the model to transition from understanding a 2D image to operating in a 3D space with a sense of depth and consequence.

What is the three-stage pipeline of DM0: Pretraining, Mid-Training, and Post-Training?

The DM0 pipeline consists of unified Pretraining on diverse web and physical corpora, Mid-Training to develop a flow-matching action expert, and Post-Training for task-specific refinement. This structured approach ensures the model retains broad semantic knowledge while gaining the specialized motor skills necessary for precision manipulation and environmental navigation in the Physical AI domain.

During the Pretraining phase, the researchers conduct large-scale training on the Vision-Language Model (VLM) using web text, driving data, and interaction logs. This stage is critical for acquiring semantic knowledge alongside physical intuition. Following this, the Mid-Training stage introduces a flow-matching action expert. This component is built atop the VLM to reconcile high-level reasoning with the granular requirements of robotic control. Finally, the Post-Training phase involves reinforcement learning and fine-tuning in specific environments, such as the RoboChallenge benchmark, to ensure the model can handle specialist tasks with high reliability.

Can DM0 be used for both robot manipulation and navigation?

DM0 is designed to function as a generalist model capable of both robot manipulation and navigation by unifying these tasks within a single framework. It achieves state-of-the-art performance on the Table30 benchmark for manipulation while demonstrating robust spatial Chain-of-Thought (CoT) reasoning that allows it to navigate through environments and interact with objects as part of a continuous workflow.

Historically, robotic systems have operated in silos: one model handles moving from point A to point B (navigation), while another handles picking up an object (manipulation). DM0 breaks these silos by treating both as embodied actions. This unification is powered by heterogeneous data, which provides the model with examples of both broad environmental movement and fine-grained hand-eye coordination. In practical applications, this means a DM0-powered robot could navigate a kitchen to find a specific fruit and then precisely arrange it in a bowl, maintaining a high-level goal-oriented focus while managing the low-level physics of each step.

Technical Breakthroughs: The Flow-Matching Action Expert

The flow-matching action expert is a specialized architectural component that allows DM0 to predict precise motor trajectories by mapping visual and linguistic inputs to physical actions. This mechanism uses a hybrid training strategy where gradients from action tasks are not backpropagated to the core VLM, thereby preventing "catastrophic forgetting" of general reasoning abilities while the robot learns specific Physical AI skills.

  • Gradient Isolation: By preventing action-related gradients from altering the VLM, DM0 ensures that learning how to turn a screw doesn't degrade the model's ability to understand complex verbal instructions.
  • Embodied Spatial Scaffolding: This strategy uses Chain-of-Thought reasoning to constrain the "action solution space," helping the robot plan its movements logically before executing them.
  • Efficiency Gains: The flow-matching approach allows for faster convergence during training compared to traditional diffusion-based models, making it more feasible to train on massive datasets.

Future Implications for Physical AI and RoboChallenge Performance

The performance of DM0 on the RoboChallenge benchmark demonstrates its potential to become the standard for general-purpose domestic and industrial robots. By achieving state-of-the-art results in both Specialist and Generalist settings on Table30, DM0 proves that embodied-native models can handle a vast array of tasks—from plugging in cables to sorting items—with minimal task-specific programming.

As the field moves toward Spatial Intelligence, the DM0 framework provides a clear roadmap. The ability to learn from diverse interaction logs means that as more robots enter the world, the pool of data for models like DM0 will grow exponentially. This creates a virtuous cycle where Physical AI becomes increasingly adept at understanding the nuances of the human world. The success of Hao Liu, Bin Xie, and Yi Yang in creating a model that "thinks" in terms of physical action suggests that the next generation of robots will not just be programmed to perform tasks, but will possess an inherent understanding of the environments they inhabit.

James Lawson

James Lawson

Investigative science and tech reporter focusing on AI, space industry and quantum breakthroughs

University College London (UCL) • United Kingdom

Readers

Readers Questions Answered

Q How does DM0 differ from traditional vision-language-action models?
A DM0 differs from traditional vision-language-action (VLA) models by being an embodied-native model that incorporates intrinsic multi-source physical priors, rather than adapting purely semantic vision-language models (VLMs) fine-tuned on robot data. It employs a hybrid training strategy where a flow-matching action expert is built atop the VLM, with gradients from embodied data not backpropagated to the VLM to preserve generalized representations, while allowing VLM training on non-embodied data. This design enables superior performance in complex manipulation tasks compared to baselines like π0.
Q Can DM0 be used for both robot manipulation and navigation?
A Yes, DM0 can be used for both robot manipulation and navigation. It excels in manipulation benchmarks like Table30, achieving state-of-the-art results in tasks such as arranging fruits and plugging cables. It also generalizes effectively to mobile contexts, showing strong chain-of-thought reasoning and potential for mobile agent applications.
Q What is the three-stage pipeline of DM0: Pretraining, Mid-Training, and Post-Training?
A The search results do not explicitly describe a three-stage pipeline of Pretraining, Mid-Training, and Post-Training for DM0. Instead, they highlight a hybrid training strategy involving joint training on large-scale datasets, building a flow-matching action expert on a VLM, and selective gradient backpropagation to balance reasoning and control. Inference supports direct action prediction or reasoned textual outputs conditioning actions.

Have a question about this article?

Questions are reviewed before publishing. We'll answer the best ones!

Comments

No comments yet. Be the first!