relay — jepa plans a turn-based puzzle in latent space

how relay actually works

What is JEPA? Joint Embedding Predictive Architecture — Yann LeCun's framework for world models that predict in a compact latent space instead of generating pixels. Fast enough for real-time planning, small enough to train in hours on one GPU.

This experiment: you click where you want your agent to push the T-block (amber), commit, and watch the physics play out. Then the JEPA opponent (cyan) plans its turn — runs CEM in the model's latent space, scoring each candidate action sequence against a goal embedding, without ever simulating pymunk.

RELAY is a JEPA experiment. The bet is that Joint-Embedding Predictive Architectures — LeCun's "post-LLM" path — can drive interactive game engines and game mechanics on CPU at interactive rates, with physical coherence, without a simulator at inference time on the model's side. This page is proof it works on a turn-based T-block puzzle.

the game

Two agents (pink/blue), a T-shaped block, two colored goals. You (amber dot) push the T toward your goal. JEPA (cyan dot) pushes toward its goal. 5 turns each, 5 environment steps per turn. Whoever gets the T closer to their goal at match-end wins. Classical pymunk physics for the actual simulation — the model is only used for planning.

the stack

The model itself follows the LeWM paper byte-for-byte: a 6M-param ViT-Tiny encoder trained from scratch, a 6-layer AR transformer predictor, and only two losses — next-embedding MSE plus SIGReg (the regularizer that prevents embedding collapse without any of the usual stabilization tricks). On top of that we add a DexWM-style joint state head (a small MLP that decodes positions from embeddings) trained alongside the predictor — it's what lets the planner reason about real-world distances instead of raw latent geometry.

training data — the actor-pattern fix

In a 2-player turn-based game, the model has to learn which ball belongs to which thrust. That depends on how the training data is structured. Our first attempt alternated which ball was acting on every single frame — but at gameplay time, each player acts for a full 5-frame turn. The temporal structure didn't match, and the model never cleanly attributed actions to balls.

The fix: regenerate the data so each ball acts for a full 5-frame turn at a time, matching real gameplay. We also retrained the position decoder on uniformly-random scenes (rather than rollouts) so it sees the full latent geometry the planner queries. Position decode error dropped from ~45 px to ~10 px on agents, and average planning error dropped from 54 px to 47 px (-13%).

Lesson: in turn-based games, your training data has to match the temporal structure of gameplay, or the model never learns which thrust belongs to which ball.

sigreg — the one trick

The whole LeWM claim is that stable JEPA from pixels needs only one extra loss term: SIGReg (Sketch Isotropic Gaussian Regularizer). Project embeddings onto 1024 random 1D directions, run an Epps-Pulley characteristic-function test on each, penalize deviation from Gaussian. Embeddings can't collapse because collapsed embeddings fail the test by construction.

the planner

At inference, for JEPA's turn: encode the last 3 frames (stride 5 env-steps apart), render a goal frame (scene with T placed at JEPA's target, agents held at their current positions), encode that too. Then run CEM: sample 96 action sequences from a Gaussian, pass each through the predictor once, score each by MSE between the predicted final embedding and the goal embedding. Take the top 15% as elites, refit mean+std, repeat 6 iterations. Return the best sequence.

The entire loop runs on CPU at interactive rates. No pymunk on the model's side — purely latent-space reasoning.

credits & sources

LeWorldModel paper: arxiv 2603.19312 (Maes, Le Lidec, Scieur, LeCun, Balestriero, 2026). DexWM joint state head: arxiv 2512.13644. Source: github.com/SotoAlt/relay-deploy.

pixels in, plan out

no simulator at inference

timeline replay