Every ~6 Hz the client renders a 128×128 frame, optionally occludes the right side, and sends the PNG to the server. The server runs a frozen 13M-param CNN JEPA (encoder → predictor → state head) and returns the predicted ball_y as the paddle target. 1,930 trainable params. No game state, only pixels.
At 0% the task is trivial — any tracker that sees the ball wins. Toggle to 40% (press e) and the right side goes dark before the model sees it. The model must imagine the ball through the blind zone. You still see the full court via the purple overlay. Watch the cyan JEPA ball and the error metrics as occlusion increases.
Live p50 / p95 / max error for JEPA vs a classical physics extrapolator, for both ball_y and ball_x. The classical baseline has ground truth (unfair advantage) — it's an oracle for comparison, not a competitor. Press R to reset the error buffers.
The left paddle above is controlled entirely by a JEPA world model. It receives no game state — only a 128x128 pixel screenshot. From that image alone, it predicts where the ball will be and moves the paddle there.
The architecture is a JEPA (Joint Embedding Predictive Architecture): instead of generating future video frames pixel by pixel, it predicts in a compressed latent space. This makes it fast enough to run in real time.
"Frozen" means the encoder and predictor weights never change after initial training — only the tiny linear state head (1,930 parameters out of 13 million) learns to read ball positions from the predictor's output. This is the DexWM recipe.
2.8% median ball_y error on the training distribution. The model decodes ball position from pixels accurately enough to play interactively against a human.
When the ball follows trajectories the model wasn't trained on (random paddle vs AI-tracked), controller success drops from 99.3% → 88.7%. Under 40% pixel occlusion (toggle it above with E), the model collapses to its training mean.
Retraining only the 1,930-param state head on occluded frames improves ball_x error by 58% at 40% occlusion. Unexpectedly, it also fixes the trajectory OOD drop: ball_y degradation goes from +125% → +6%.
As a comparison, we tested swapping our custom CNN encoder for a frozen DINOv2 (Meta's pretrained model, zero Pong training). DINOv2 handles 40% occlusion at cosine 0.94 while our CNN drops to 0.20. But our custom CNN + predictor wins at the actual game because it predicts ahead. Robustness and task precision are different things.
# clone and install git clone https://github.com/SotoAlt/lepong cd lepong pip install torch torchvision fastapi uvicorn pillow numpy websockets # download model from huggingface pip install huggingface_hub python -c "from huggingface_hub import hf_hub_download; hf_hub_download('sotoalt/lepong', 'lepong_statehead_occ_aug.pt', local_dir='checkpoints')" # run the demo python -m server.infer --checkpoint checkpoints/lepong_statehead_occ_aug.pt --port 8791 # open http://localhost:8791
Code, model weights, and training data: github.com/SotoAlt/lepong · huggingface.co/sotoalt/lepong
Isn't this just a linear probe on frozen CNN features?
Mostly, yes. The state head is a linear layer reading from the predictor's output. The difference from a standard probe: it reads from the predictor's embedding (what the model thinks comes next), not the encoder's (what the model sees now). With zero actions passed to the predictor, the margin is small. We're transparent about this.
Can this model do multi-step rollout?
No. We tried — it collapses after 5-10 autoregressive steps (a known JEPA limitation). This is a 1-step state decoder with a predictor in the loop, not a long-horizon planner. Calling it a "world model" is technically correct (it predicts future state from observations) but it doesn't plan ahead the way you might imagine.
Pong is trivial. Any CNN can learn ball position.
Correct. Pong was chosen because it trains in minutes on CPU. The point isn't "can a neural net play Pong" — it's the OOD analysis: what breaks, what fixes it, and why pretrained encoders help. Those findings transfer to harder domains.
The DINOv2 comparison is unfair — you gave it no predictor.
Deliberately so. We compared our full pipeline (encoder + predictor + state head) against DINOv2 as encoder only. The fair comparison would be DINOv2 + a trained predictor on top — which is what jepa-wms does, and it would likely win at both robustness and task performance. The point of our test was to isolate the encoder robustness gap, not to claim our CNN is better overall.
Is the "accidental fix" on trajectory OOD statistically significant?
The per-frame decoder result (ball_y OOD drop +125% → +6%, measured on 2000 frames) is robust. The controller result (+2.0 pts on 150 trials) is within noise. We're confident about the decoder finding, less confident about the controller number.
The model passes zero actions to the predictor. Is it actually predicting?
The predictor receives zero-vectors for actions because we don't feed the paddle action back. So the "prediction" is mostly the predictor's learned inertia prior: "given these 3 frames, what embedding comes next." It IS adding value over raw encoder output (ball_y correlation 0.991 vs 0.966 without predictor), but the margin is small. The predictor matters more in environments with action-dependent dynamics.