Visual MPC
Sample K candidate action sequences, run them all in one server-side
forward pass, score the rollouts, pick the best. The
predict_batch method makes this
one HTTP call.
The pattern
import numpy as npimport dream client = dream.Client()model = client.models.get("dreamdojo-2b-gr1") def score(rollout: dream.Rollout) -> float: """Your task-specific reward. Could be: - distance to a goal frame - a learned reward model - a pixel-space cost like 'is the cup upright' """ return reward_fn(rollout.frames) # rollout.frames is (48, 480, 640, 3) uint8 # ── 1. Sample K candidate action sequences ────────────────────────────K = 8candidates = sample_candidates(K, T=48, action_dim=384)# shape: (8, 48, 384) float32 — typically perturbations of a base plan # ── 2. Run them all in one server roundtrip ───────────────────────────batch = model.predict_batch(start_frame=current_frame, actions=candidates) # ── 3. Score and pick ─────────────────────────────────────────────────scores = [score(r) for r in batch]best_idx = max(range(K), key=scores.__getitem__)best_actions = candidates[best_idx] print(f"K={K}, total cost ${batch.cost_usd}, wall {batch.wall_s:.2f}s")batch[best_idx].save("best_rollout.mp4")Why batch over gather
Three reasons predict_batch beats firing K independent model.predict
calls in parallel:
- Fused server-side forward. The K candidates share the same start-frame encoding and the diffusion model batches them — the GPU runs ~the same wall as K=1. K=8 is roughly 25% slower than K=1, not 8×.
- One transit. One TLS handshake, one redirect-follow, one response. K independent gathers add 8× of overhead.
- Cost the same. $0.0005 / frame regardless of batched-or-not. K=8 batch and K=8 gather both cost K × T × $0.0005.
For DreamDojo on H100, K=8 takes ~3.2 s end-to-end on a warm
container. The same K=8 via asyncio.gather would take ~16 s.
Sampling K candidates
The right way to generate (K, T, action_dim) depends on your problem.
Common patterns:
# Random search around a reference planref = np.load("base_plan.npy") # (48, 384)noise = np.random.randn(K, 48, 384) * 0.1candidates = (ref[None, :, :] + noise).astype(np.float32) # Cross-entropy method — sample from a learned proposal distributioncandidates = cem_sample(prior_dist, K=8, T=48) # Action-space lattice — sweep over a few discrete strategiescandidates = np.stack([base, base + d, base - d, base * 1.1])Scoring options
For most physics-grounded tasks, score on the predicted frames:
- Goal-distance — compute optical-flow / feature distance between the rollout's last frame and a target image.
- Learned reward model — pass
rollout.framesthrough a vision reward network trained on human ratings. - Latent prediction error — encode each frame with a VAE / DINO, measure trajectory smoothness in latent space.
Avoid scoring on raw pixel-MSE against a target — it's a notoriously poor proxy for task success.
Real-time MPC loop
import asyncio, dream async def control_loop(): async with dream.AsyncClient() as client: model = await client.models.get("dreamdojo-2b-gr1") current_frame = capture_camera_frame() while not done(): candidates = sample_candidates(K=8, T=48, action_dim=384) batch = await model.predict_batch( start_frame=current_frame, actions=candidates, ) scores = [score(r) for r in batch] best = candidates[max(range(8), key=scores.__getitem__)] execute_first_action(best[0]) current_frame = capture_camera_frame()Per-step wall: ~3.5 s on a warm container, dominated by the engine forward. Real-time-loop budget depends on your task; for slow manipulation (cup-pouring, button-pressing) this is workable.
Cost discipline
Roughly K × per-rollout cost, where per-rollout on GR-1 is
$0.0245 (49 frames billed × $0.0005). The actual batch.cost_usd
comes in slightly lower because the server amortizes the shared start
frame across K.
| K | Cost / batch (≈) | Cost / 1K batches (≈) |
|---|---|---|
| 4 | $0.098 | $98 |
| 8 | $0.196 | $196 |
| 16 | $0.392 | $392 |
If you're running a 1-Hz MPC loop for an hour, you'll hit ~3,600 batches. At K=8 that's ~$700. Tune K against your reward variance — often K=4 with smarter sampling beats K=16 with random search.