The Allen Institute for AI (AI2) has recently unveiled MolmoAct 7B, a new open-source AI robotic system designed to bring advanced intelligence to the physical world. MolmoAct is designed to help robots navigate and interact with complex, unstructured environments, such as homes, warehouses, and disaster zones. Unlike many existing robotic systems that function as "black boxes," MolmoAct prioritizes transparency, adaptability, and 3D spatial reasoning.
MolmoAct is classified as an Action Reasoning Model (ARM), which signifies its ability to interpret natural language instructions and devise a sequence of physical actions to execute them in real-world settings. Traditional robotics models often treat tasks as single, opaque steps. In contrast, ARMs break down high-level instructions into a transparent chain of decisions grounded in spatial awareness. This involves 3D-aware perception, where the robot understands its environment using depth and spatial context, and visual waypoint planning, where a step-by-step task trajectory is outlined in image space. This layered reasoning allows MolmoAct to interpret commands and break them down into sub-tasks. For example, when instructed to "sort this trash pile," the robot will recognize the scene, group objects by type, grasp them individually, and repeat the process.
One of MolmoAct's key innovations is its ability to transform 2D images into 3D visualizations, which allows it to "think" in three dimensions. According to AI2, the model generates visual reasoning tokens that convert 2D image inputs into 3D spatial plans. This enables robots to navigate the physical world with greater intelligence and control by understanding the relationships between space, movement, and time. Before executing any commands, MolmoAct grounds its reasoning in pixel space and overlays its planned motion trajectory directly onto the images it takes as input. This visual trace provides a preview of the intended movements, enabling users to correct mistakes or prevent unwanted behaviors. Users can also adjust these plans using natural language or by sketching corrections on a touchscreen, offering fine-grained control and enhancing safety.
MolmoAct builds upon AI2’s Molmo multimodal AI model, extending its capabilities to include 3D reasoning and robot action. AI2's flagship OLMo large language model is a fully transparent alternative to proprietary systems, with openly available training data, code, and model weights. AI2 trained MolmoAct 7B, the first in its model family, on a curated dataset of approximately 12,000 "robot episodes" from real-world environments like kitchens and bedrooms. These demonstrations were transformed into robot-reasoning sequences, exposing how complex instructions map to grounded, goal-directed actions.
AI2 evaluated MolmoAct's pre-training capabilities using SimPLER, a benchmark containing simulated test environments for common real robot manipulation setups. MolmoAct achieved a state-of-the-art out-of-distribution task success rate of 72.1%, surpassing models from other organizations.
MolmoAct is fully open-source and reproducible, aligning with AI2's mission to promote transparency and collaboration in AI development. AI2 is releasing all the necessary components to build, run, and extend the model, including training pipelines, pre- and post-training datasets, model checkpoints, and evaluation benchmarks. This open approach aims to address the "black box problem" associated with many existing AI models, making MolmoAct safe, interpretable, and adaptable. The model and associated resources are available on AI2's Hugging Face repository.