
"Researchers from Google DeepMind have recently described a new approach for teaching intelligent agents to solve complex, long-term tasks by training them exclusively on video footage rather than through direct interaction with the environment. Their new agent, called Dreamer 4, demonstrated the ability to mine diamonds playing Minecraft after being trained on videos, without ever actually playing the game. The researchers dubbed their approach imagination training to emphasize that the agent learns solely from offline data, without any interaction with the physical world."
"Their model architecture comprises two main components: a tokenizer that compresses each video frame into a continuous representation, and a dynamics model that predicts the next world representation given the current one and the chosen action. To make the dynamics model more efficient, the researchers employed shortcut forcing, training the model to take larger steps when predicting future frames without losing accuracy. As a result, Dreamer 4 can generate new world representations in real time."
Dreamer 4 is an agent trained entirely on offline video data, learning to solve complex, long-term tasks without interacting with the environment. The architecture uses a tokenizer compressing video frames into continuous representations and a dynamics model that predicts next-world representations conditioned on actions. Shortcut forcing trains the dynamics model to take larger steps for efficient long-horizon prediction. Causal attention across space and time plus specialized memory enables real-time generation and maintains at least 20 frames per second on a single GPU. Dreamer 4 achieved mining diamonds in Minecraft by selecting sequences of over 20,000 mouse and keyboard actions from raw pixels.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]