The Generative Bottleneck

Video World Models are Powerful
but Autoregressive Generation is Slow

Synthesizing high-quality spatial dynamics requires redundant deep network forward passes across hundreds of denoising steps.

The Flaw in Existing Solutions

Naive Caching Causes Motion Drift

Ground Truth

✕

Stale Reuse

Averages hide local motion. Blindly copying old activations destroys temporal coherence before global loss triggers a refresh.

WorldCache

Content-Aware Caching for Accelerated World Models

A training-free framework that predicts skipped computation rather than copying it over blindly.

Smart Architecture

Motion-Aware Caching Logic

WorldCache treats caching like a localized prediction. It controls the pace with causal tracking while interpolating the next state.

Core Components

Driven by Four Key Ideologies

Causal Feature Caching

Dynamically scales caching tolerance based on early layer motion velocity.

Saliency Weighted Drift

Penalizes caching errors in perceptually critical high-frequency regions.

Optimal Feature Approx.

Interpolates skipped cache states using trajectory matching.

Adaptive Scheduling

Exponentially relaxes caching constraints in later denoising stages.

Early

Late

Empirical Benchmarks

2.30×

Uncompromised Speedup

Over baseline architectures while strictly maintaining visual fidelity, motion dynamics, and prompt adherence across Cosmos, WAN2.1, and DreamDojo.

Technical Rigor

Evaluation & Generalization

World Types

Image2World Text2World

Model Backbones

Cosmos-Predict2.5 (2B) Cosmos-Predict2.5 (14B) WAN2.1 (1.3B) WAN2.1 (14B) DreamDojo (1.3B)

Benchmarks

PAI-Eval EgoDex-Eval

Cosmos-Predict 2.5 14B · WorldCache Generation outputs

Flawless Temporal Coherence

Video World Models are Powerful but Autoregressive Generation is Slow