WorldCache

Overview

Why simple feature reuse breaks in world models

Standard training-free caching methods reuse stale activations whenever average drift looks small. That shortcut can fail in dynamic scenes, where local motion and perceptually important objects change long before the global average signals trouble.

Abstract

WorldCache is a training-free caching framework for diffusion-transformer world models. It improves both when to reuse features and how to approximate skipped computation through motion-adaptive thresholds, saliency-weighted drift estimation, optimal feature approximation, and adaptive threshold scheduling across denoising steps.

ArXiv Preprint Available on arXiv

01

Static averages hide local motion

Large backgrounds mask meaningful changes in hands, agents, or manipulated objects.

02

Copying the past is too crude

Frozen snapshots create ghosting, blur, and motion drift once the rollout diverges.

03

Late denoising has more reuse

After the global layout forms, better caching decisions produce the largest gains.

WorldCache treats caching like a careful prediction, not a blind shortcut.

Video world models spend much of their time repeating similar computation across denoising steps. WorldCache saves time by reusing deep features only when the scene is stable enough, then estimating the skipped features with motion-aware approximation.

Training-free Motion-aware Saliency-aware Cosmos + WAN2.1 + EgoDex

Method

The four modules in WorldCache

WorldCache changes the skip rule, the reuse rule, and the threshold schedule so that caching follows motion and denoising phase instead of relying on a single fixed heuristic.

WorldCache pipeline diagram. — Probe blocks estimate drift. Cache hits route through OFA; cache misses execute deep blocks and refresh the cache.

CFC

Causal Feature Caching

Why it helps

Paper note

Saliency overlay on rollout frames. — SWD gives more weight to detail-rich regions where caching errors are easiest to notice.

Adaptive threshold scheduling comparison. — ATS captures more late-stage reuse by matching the threshold to the denoising phase.

Results

Results across Cosmos, WAN2.1, and EgoDex-Eval

Switch between main benchmarks, transfer results, robotics evaluation, and the unified all-results view.

Speed-quality frontier

Latency comparison

Lower latency is better. Each bar shows the paper's reported runtime and speedup.

Table view

Paper table values

Scaling

How the gain changes with denoising step budget

Longer denoising trajectories give caching more opportunities. This view tracks latency and speedup across reported step budgets.

Interactive step-budget view

Select a budget to inspect latency and speedup values.

WorldCache

DiCache

Baseline

Visuals

Qualitative evidence across scenes and scales

WorldCache stays closer to the baseline rollout in dynamic and interaction-heavy regions where simpler caching strategies drift.

Main qualitative comparison. — WorldCache remains visually closer to the baseline in motion-sensitive regions.

Cosmos-2B crossing scene. — WorldCache better preserves pedestrian identity and background consistency.

Cosmos-14B kitchen scene. — In kitchen interaction, WorldCache keeps hands and carried objects more stable.

Additional qualitative result. — WorldCache maintains visual fidelity in challenging dynamic scenes.

Citation

Reference

If you find WorldCache useful in your research, please consider citing our work.

BibTeX

@article{nawaz2026worldcache,
  title = {WorldCache: Content-Aware Caching for Accelerated Video World Models},
  author = {Umair Nawaz and Ahmed Heakl and Ufaq Khan and Abdelrahman Shaker and Salman Khan and Fahad Shahbaz Khan},
  eprint = {2603.22286},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
  url = {https://arxiv.org/abs/2603.22286},
  year = {2026}
}