Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatially-varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture.
Method
Autoregressive Inference Pipeline: WorldWarp generates video chunk-by-chunk using an autoregressive loop. At each iteration, we process the available history to estimate camera poses and initialize a 3D point cloud. This geometry is used to optimize a 3D Gaussian Splatting (3DGS) representation, which serves as a high-fidelity 3D cache. We then render forward-warped images ("priors") for the upcoming chunk. These geometric priors, along with a descriptive text prompt, are fed into our ST-Diff model. The model refines these hints, filling in occlusions and correcting distortions to produce the final video chunk, which then becomes the history for the next iteration.
Training with Spatially-Varying Noise: To train the model for the "fill-and-revise" task, we prepare a composite latent sequence by mixing warped priors (valid regions) with ground-truth features (occluded regions). A spatially-varying noise schedule is applied, where valid regions receive partial noise and blank regions receive pure noise. The resulting noisy latents are fed into our model \( G_\theta \), which is trained to predict the target velocity (defined as \( \mathbf{\epsilon}_t - \mathbf{z}_t \)), forcing it to learn the flow from the noisy composite latent back towards the original ground-truth latent sequence \( \mathcal{Z} \).
Generated Scenes
Results generated from single images or text prompts. Hover over videos to see prompts.
Cinematic anime-style empty classroom at golden hour. Warm sunlight streaming through windows, dust particles floating in light beams. Shadows lengthening on wooden desks. 4k, nostalgic atmosphere.
A vibrant, sunlit greenhouse teems with a lush array of colorful flowers and verdant foliage. The glass roof allows ample sunlight to filter through, casting a warm glow throughout the space. The environment suggests a meticulously maintained botanical garden, with rows of blooming flowers and dense greenery filling the room.
Lush green Sago Palm cycad housed in a long, modern white concrete planter within a brightly lit atrium. The feathery fronds sway gently in a subtle airflow. Tiled flooring reflects the overhead lights and daylight. Peaceful, photorealistic atmosphere. 4k.
Stop-motion animation of a whimsical clay forest floor. The camera slowly pans across vibrant green clay grass, red mushrooms with white spots, and a wavy blue clay river. The blue river ripples gently, and the grass blades and mushrooms subtly wiggle. Soft, playful lighting highlights the textured clay.
Modern minimalist living room facing an ocean sunset. Gentle waves rolling on the horizon, warm orange light reflecting on the polished floor. Photorealistic 4k.
Cozy fantasy library, curved wooden shelves filled with books. Candles flickering on the central table, warm light glowing. Slow dolly forward. 4k.
Aerial drone shot of Marina Bay Sands and ArtScience Museum at blue hour. Warm city lights glowing against the twilight sky, vibrant reflections shimmering on the water. Ships moving slowly in the distant strait, clouds drifting. 8k, majestic skyline.
Cinematic night street in Tokyo, neon signs glowing and flickering. Rain falling heavily on wet asphalt, colorful reflections shimmering on the ground. Pedestrians walking in the distance. Slow camera dolly forward. 4k, cyberpunk atmosphere.
A white bicycle leaning against a black metal park bench. Green trees in the background swaying slightly in a gentle breeze. Soft overcast lighting, depth of field. Photorealistic.
A round wooden garden table with a dried palm leaf in a coconut vase. Lush green bushes and ivy in the background rustling gently in the wind. Soft natural daylight, peaceful backyard atmosphere. 4k.
A massive Triceratops fossil skull in a modern museum. Soft gallery lighting reflecting on the bone texture. 4k, realistic documentary style.
A large Sago Palm in a modern white planter inside a bright atrium. Green fronds swaying gently, tiled floor reflecting overhead lights. 4k, architectural style.
Longer video? Let's try!
A vibrant, sunlit greenhouse teems with a lush array of colorful flowers and verdant foliage. The glass roof allows ample sunlight to filter through, casting a warm glow throughout the space. The environment suggests a meticulously maintained botanical garden, with rows of blooming flowers and dense greenery filling the room.
Cinematic anime-style empty classroom at golden hour. Warm sunlight streaming through windows, dust particles floating in light beams. Shadows lengthening on wooden desks. 4k, nostalgic atmosphere.
A round wooden garden table with a dried palm leaf in a coconut vase. Lush green bushes and ivy in the background rustling gently in the wind. Soft natural daylight, peaceful backyard atmosphere. 4k.
A white bicycle leaning against a black metal park bench. Green trees in the background swaying slightly in a gentle breeze. Soft overcast lighting, depth of field. Photorealistic.
Modern minimalist living room facing an ocean sunset. Gentle waves rolling on the horizon, warm orange light reflecting on the polished floor. Photorealistic 4k.
A large Sago Palm in a modern white planter inside a bright atrium. Green fronds swaying gently, tiled floor reflecting overhead lights. 4k, architectural style.
A photorealistic scene of a historic cannon situated in front of a dilapidated stone fortress under a bright blue sky dotted with fluffy white clouds. The cannon, mounted on a wooden carriage, rests amidst a rugged landscape of grassy terrain and scattered rocks.
A vibrant, stylized rural landscape unfolds under a clear blue sky dotted with fluffy clouds. Rows of lush green vines stretch towards the horizon, bordered by a simple metal fence. Beyond the vineyard, rolling hills covered in verdant fields extend into the distance, punctuated by clusters of trees and small white buildings nestled among the greenery.
BibTeX
@misc{kong2025worldwarp,
title={WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion},
author={Hanyang Kong and Xingyi Yang and Xiaoxu Zheng and Xinchao Wang},
year={2025},
eprint={2512.19678},
archivePrefix={arXiv},
primaryClass={cs.CV}
}