lepong — JEPA plays pong from pixels

[lepong] pixel-input jepa

A small world model plays pong by watching pixels — it never sees game state, only a 128x128 screenshot. Move your mouse to play against it.

server connecting…

mode jepa

frames logged 0

What is JEPA? Joint Embedding Predictive Architecture — a world model that learns to predict the next state of a scene from pixel observations. Instead of generating future frames pixel by pixel (like a video model), it predicts in a compressed latent space, making it fast enough for real-time control.

This experiment: the left paddle is controlled entirely by a 13M-parameter JEPA world model. Every ~6 times per second, this page captures a 128x128 screenshot of the game, sends it to a server running the model, and gets back a predicted ball position. The paddle moves to that prediction. The world model never receives game state — it only sees the raw pixel screenshot and must figure out where the ball is going.

JEPA0

YOU0

LEPONG

loading world model...

ai paddle mode

JEPA = world model reads pixels, predicts ball CLASSICAL = physics oracle with ground truth REACTIVE = follows ball.y, no prediction

occlusion (optional experiment)

Blacks out a portion of the image before the model sees it. You still see the full court. Try 40% and watch how the AI degrades.

score

0 : 0

live ball_y error

jepa median

— %

jepa p95

— %

jepa max

— %

classical median

— %

classical p95

— %

classical max

— %

samples

live ball_x error

jepa median

— %

jepa p95

— %

jepa max

— %

classical median

— %

classical p95

— %

classical max

— %

runtime

rally

plan time

— ms

latency

— ms

history

—

pixels in, paddle out

Every ~6 Hz the client renders a 128×128 frame, optionally occludes the right side, and sends the PNG to the server. The server runs a frozen 13M-param CNN JEPA (encoder → predictor → state head) and returns the predicted ball_y as the paddle target. 1,930 trainable params. No game state, only pixels.

the occlusion toggle

At 0% the task is trivial — any tracker that sees the ball wins. Toggle to 40% (press e) and the right side goes dark before the model sees it. The model must imagine the ball through the blind zone. You still see the full court via the purple overlay. Watch the cyan JEPA ball and the error metrics as occlusion increases.

what the metrics show

Live p50 / p95 / max error for JEPA vs a classical physics extrapolator, for both ball_y and ball_x. The classical baseline has ground truth (unfair advantage) — it's an oracle for comparison, not a competitor. Press R to reset the error buffers.

what we built

The left paddle above is controlled entirely by a JEPA world model. It receives no game state — only a 128x128 pixel screenshot. From that image alone, it predicts where the ball will be and moves the paddle there.

The architecture is a JEPA (Joint Embedding Predictive Architecture): instead of generating future video frames pixel by pixel, it predicts in a compressed latent space. This makes it fast enough to run in real time.

PIPELINE

128x128 frame → custom CNN encoder (frozen, trained from scratch on Pong) → 192-dim → 6-layer transformer predictor (frozen) → Linear(192, 10) state head → ball_y → paddle

all custom-trained, no pretrained models · 13M total params · 1,930 trainable (just the orange state head) · trained on 30K frames

"Frozen" means the encoder and predictor weights never change after initial training — only the tiny linear state head (1,930 parameters out of 13 million) learns to read ball positions from the predictor's output. This is the DexWM recipe.

what we found

FINDING 1 — IT WORKS

2.8% median ball_y error on the training distribution. The model decodes ball position from pixels accurately enough to play interactively against a human.

FINDING 2 — IT BREAKS UNDER DISTRIBUTION SHIFT

When the ball follows trajectories the model wasn't trained on (random paddle vs AI-tracked), controller success drops from 99.3% → 88.7%. Under 40% pixel occlusion (toggle it above with E), the model collapses to its training mean.

FINDING 3 — DATA AUGMENTATION PARTIALLY FIXES IT

Retraining only the 1,930-param state head on occluded frames improves ball_x error by 58% at 40% occlusion. Unexpectedly, it also fixes the trajectory OOD drop: ball_y degradation goes from +125% → +6%.

FINDING 4 — PRETRAINED ENCODERS ARE 4-7x MORE ROBUST

As a comparison, we tested swapping our custom CNN encoder for a frozen DINOv2 (Meta's pretrained model, zero Pong training). DINOv2 handles 40% occlusion at cosine 0.94 while our CNN drops to 0.20. But our custom CNN + predictor wins at the actual game because it predicts ahead. Robustness and task precision are different things.

reproduce

# clone and install
git clone https://github.com/SotoAlt/lepong
cd lepong
pip install torch torchvision fastapi uvicorn pillow numpy websockets

# download model from huggingface
pip install huggingface_hub
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('sotoalt/lepong', 'lepong_statehead_occ_aug.pt', local_dir='checkpoints')"

# run the demo
python -m server.infer --checkpoint checkpoints/lepong_statehead_occ_aug.pt --port 8791
# open http://localhost:8791

Code, model weights, and training data: github.com/SotoAlt/lepong · huggingface.co/sotoalt/lepong

FAQ

Isn't this just a linear probe on frozen CNN features?

Mostly, yes. The state head is a linear layer reading from the predictor's output. The difference from a standard probe: it reads from the predictor's embedding (what the model thinks comes next), not the encoder's (what the model sees now). With zero actions passed to the predictor, the margin is small. We're transparent about this.

Can this model do multi-step rollout?

No. We tried — it collapses after 5-10 autoregressive steps (a known JEPA limitation). This is a 1-step state decoder with a predictor in the loop, not a long-horizon planner. Calling it a "world model" is technically correct (it predicts future state from observations) but it doesn't plan ahead the way you might imagine.

Pong is trivial. Any CNN can learn ball position.

Correct. Pong was chosen because it trains in minutes on CPU. The point isn't "can a neural net play Pong" — it's the OOD analysis: what breaks, what fixes it, and why pretrained encoders help. Those findings transfer to harder domains.

The DINOv2 comparison is unfair — you gave it no predictor.

Deliberately so. We compared our full pipeline (encoder + predictor + state head) against DINOv2 as encoder only. The fair comparison would be DINOv2 + a trained predictor on top — which is what jepa-wms does, and it would likely win at both robustness and task performance. The point of our test was to isolate the encoder robustness gap, not to claim our CNN is better overall.

Is the "accidental fix" on trajectory OOD statistically significant?

The per-frame decoder result (ball_y OOD drop +125% → +6%, measured on 2000 frames) is robust. The controller result (+2.0 pts on 150 trials) is within noise. We're confident about the decoder finding, less confident about the controller number.

The model passes zero actions to the predictor. Is it actually predicting?

The predictor receives zero-vectors for actions because we don't feed the paddle action back. So the "prediction" is mostly the predictor's learned inertia prior: "given these 3 frames, what embedding comes next." It IS adding value over raw encoder output (ball_y correlation 0.991 vs 0.966 without predictor), but the margin is small. The predictor matters more in environments with action-dependent dynamics.