DreamDrone

Hanyang Kong 1 Dongze Lian 1 Michael Bi Mi 2 Xinchao Wang 1

 1  Learning and Vision Lab, National University of Singapore
 2  Huawei International Pte. Ltd.

a desolate street in a Doomsday ruins style.

aerial view of city, lego style, high-resolution.

a winding trail in a fairy tale forest.

a wide boulevard in a retro-futuristic style.

a mountain village in the Japanese Ukiyo-e woodblock print style.

a corridor in a medieval castle.

a grand hallway in a Baroque-style palace.

a high building in a Steampunk world.

a long, narrow corridor in an abandoned hospital from a horror game.

Backyards of Old Houses in Antwerp in the Snow, van Gogh.


 
TD;LR: zero-shot training-free text-to-perceptual scene generation.

Abstract

We introduce DreamDrone, an innovative method for generating unbounded flythrough scenes from textual prompts. Central to our method is a novel feature-correspondence-guidance diffusion process, which utilizes the strong correspondence of intermediate features in the diffusion model. Leveraging this guidance strategy, we further propose an advanced technique for editing the intermediate latent code, enabling the generation of subsequent novel views with geometric consistency. Extensive experiments reveal that DreamDrone significantly surpasses existing methods, delivering highly authentic scene generation with exceptional visual quality. This approach marks a significant step in zero-shot perpetual view generation from textual prompts, enabling the creation of diverse scenes, including natural landscapes like oases and caves, as well as complex urban settings such as Lego-style street views. Code is available at https://github.com/HyoKong/DreamDrone.


How It Works

Starting from a real or generated RGBD ($I$, $D$) image at the current view, we apply DDIM backward steps to obtain intermediate latent code $x_{t_1}$ at timestep $t_1$ using a pre-trained U-Net model. A low-pass wrapping strategy is applied to generate latent code for the next novel view. A few more DDPM forward steps from timestep $t_1$ to $t_2$ are followed for enlarging the degree of freedom w.r.t. the wrapped latent code. In the reverse progress, we apply pre-trained U-Net to recover the novel view from $x_{t_2}'$. The cross-view self-attention module and feature-correspondence guidance are applied to maintain consistency between $x_{t_2}$ and $x_{t_2}'$. The right side shows the wrapped images and our generated novel view $I'$. Our method greatly alleviates blurring, in-consistency, and distortion. The overall pipeline is zero-shot and training-free.



Paper

    

DreamDrone
ArXiv

[paper]

 

Bibtex


    @misc{kong2023dreamdrone,
        title={DreamDrone}, 
        author={Hanyang Kong and Dongze Lian and Michael Bi Mi and Xinchao Wang},
        year={2023},
        eprint={2312.08746},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }
			  

 

Acknowledgments

The website design is borrowed from this website. Thanks!