LeWorldModel: First Pixel-Native JEPA — 15M Params, 48x Faster Planning

LeWorldModel, developed by Mila, NYU, Samsung SAIL, and Brown University (with no Meta authors), is the first JEPA (Joint Embedding Predictive Architecture) trained end-to-end from raw pixels. It uses just 15 million parameters, trains on a single GPU in a few hours, and achieves 48× faster planning than foundation-model-based world models while remaining competitive on 2D and 3D planning benchmarks. The architecture eliminates the need for exponential moving averages or pretrained encoders that caused previous JEPAs to collapse, reducing six hyperparameters to one and fitting on a laptop GPU.

Why It Matters

If a 15M-parameter pixel-native world model can plan 48× faster than foundation-model baselines at competitive accuracy, the argument for JEPA-based architectures as the substrate for physical AI agents becomes significantly more concrete — and accessible to researchers without hyperscale compute budgets.