[relay] turn-based jepa puzzle
Push the T into your goal. JEPA plans its turn in latent space.
server
opponent jepa
turn
your turns
5
jepa turns
5
next
how to play Click up to 3 points on the board to draw your path — your turn ends on the 3rd click (or press GO sooner). Push the T into your goal.
scene
JEPA
SCRUB
JEPA IS THINKING
running CEM in latent space
timeline
your turn
click up to 3 points on the board to aim. each click adds a leg to the path — the agent walks through them in sequence.
telemetry
turn
history0
last plan cost
your progress
jepa progress
status
click "new match" to begin.
match
01

pixels in, plan out

For each candidate action sequence, the model encodes the 3-frame pixel history, predicts the embedding of the final state after 5 env steps, and scores it against the goal embedding (the scene with the T-block at JEPA's target, encoded the same way). The planner refits the sampling distribution over 6 iterations — CEM in latent space.

02

no simulator at inference

The opponent never runs pymunk while planning. Every candidate move is scored by passing it through the trained model — encode the current pixels, predict the resulting embedding, compare against the goal embedding. No physics engine on the model's side.

Classical pymunk only runs to execute the chosen move so you see real collisions. The reasoning is pure latent space.

03

timeline replay

Every turn is recorded — 5 dream steps per JEPA turn, 5 physics steps per your turn. The scrubber below the canvas shows all of them as minor ticks; major cyan ticks mark turn boundaries. Drag to inspect any past state. esc returns to live.

how relay actually works

What is JEPA? Joint Embedding Predictive Architecture — Yann LeCun's framework for world models that predict in a compact latent space instead of generating pixels. Fast enough for real-time planning, small enough to train in hours on one GPU.

This experiment: you click where you want your agent to push the T-block (amber), commit, and watch the physics play out. Then the JEPA opponent (cyan) plans its turn — runs CEM in the model's latent space, scoring each candidate action sequence against a goal embedding, without ever simulating pymunk.

RELAY is a JEPA experiment. The bet is that Joint-Embedding Predictive Architectures — LeCun's "post-LLM" path — can drive interactive game engines and game mechanics on CPU at interactive rates, with physical coherence, without a simulator at inference time on the model's side. This page is proof it works on a turn-based T-block puzzle.

the game

Two agents (pink/blue), a T-shaped block, two colored goals. You (amber dot) push the T toward your goal. JEPA (cyan dot) pushes toward its goal. 5 turns each, 5 environment steps per turn. Whoever gets the T closer to their goal at match-end wins. Classical pymunk physics for the actual simulation — the model is only used for planning.

the stack

The model itself follows the LeWM paper byte-for-byte: a 6M-param ViT-Tiny encoder trained from scratch, a 6-layer AR transformer predictor, and only two losses — next-embedding MSE plus SIGReg (the regularizer that prevents embedding collapse without any of the usual stabilization tricks). On top of that we add a DexWM-style joint state head (a small MLP that decodes positions from embeddings) trained alongside the predictor — it's what lets the planner reason about real-world distances instead of raw latent geometry.

training data — the actor-pattern fix

In a 2-player turn-based game, the model has to learn which ball belongs to which thrust. That depends on how the training data is structured. Our first attempt alternated which ball was acting on every single frame — but at gameplay time, each player acts for a full 5-frame turn. The temporal structure didn't match, and the model never cleanly attributed actions to balls.

The fix: regenerate the data so each ball acts for a full 5-frame turn at a time, matching real gameplay. We also retrained the position decoder on uniformly-random scenes (rather than rollouts) so it sees the full latent geometry the planner queries. Position decode error dropped from ~45 px to ~10 px on agents, and average planning error dropped from 54 px to 47 px (-13%).

Lesson: in turn-based games, your training data has to match the temporal structure of gameplay, or the model never learns which thrust belongs to which ball.

sigreg — the one trick

The whole LeWM claim is that stable JEPA from pixels needs only one extra loss term: SIGReg (Sketch Isotropic Gaussian Regularizer). Project embeddings onto 1024 random 1D directions, run an Epps-Pulley characteristic-function test on each, penalize deviation from Gaussian. Embeddings can't collapse because collapsed embeddings fail the test by construction.

the planner

At inference, for JEPA's turn: encode the last 3 frames (stride 5 env-steps apart), render a goal frame (scene with T placed at JEPA's target, agents held at their current positions), encode that too. Then run CEM: sample 96 action sequences from a Gaussian, pass each through the predictor once, score each by MSE between the predicted final embedding and the goal embedding. Take the top 15% as elites, refit mean+std, repeat 6 iterations. Return the best sequence.

The entire loop runs on CPU at interactive rates. No pymunk on the model's side — purely latent-space reasoning.

credits & sources

LeWorldModel paper: arxiv 2603.19312 (Maes, Le Lidec, Scieur, LeCun, Balestriero, 2026). DexWM joint state head: arxiv 2512.13644. Source: github.com/SotoAlt/relay-deploy.

JEPA WINS

match complete

you · jepa

new match