Wonder3D: What Is Cross-Domain Diffusion?

cover
1 Jan 2025

Abstract and 1 Introduction

2. Related Works

2.1. 2D Diffusion Models for 3D Generation

2.2. 3D Generative Models and 2.3. Multi-view Diffusion Models

3. Problem Formulation

3.1. Diffusion Models

3.2. The Distribution of 3D Assets

4. Method and 4.1. Consistent Multi-view Generation

4.2. Cross-Domain Diffusion

4.3. Textured Mesh Extraction

5. Experiments

5.1. Implementation Details

5.2. Baselines

5.3. Evaluation Protocol

5.4. Single View Reconstruction

5.5. Novel View Synthesis and 5.6. Discussions

6. Conclusions and Future Works, Acknowledgements and References

4.2. Cross-Domain Diffusion

Our model is built upon pre-trained 2D stable diffusion models [45] to leverage its strong generalization. However, current 2D diffusion models [31, 45] are designed for a single domain, so the main challenge is how to effectively extend stable diffusion models that are capable of operating on more than one domain.

Naive Solutions. To achieve this goal, we explore several possible designs. A straightforward solution is to add four more channels to the output of the UNet module representing the extra domain. Therefore, the diffusion model can simultaneously output normals and color image domains. However, we notice that such a design suffers from low convergence speed and poor generalization. This is because the channel expansion may perturb the pre-trained weights of stable diffusion models and therefore cause catastrophic model forgetting.

Figure 4. The illustration of the structure of the multi-view cross domain transformer block.

The domain switcher s is first encoded via positional encoding [39] and subsequently concatenated with the time embedding. This combined representation is then injected into the UNet of the stable diffusion models. Interestingly, experiments show that this subtle modification does not significantly alter the pre-trained priors. As a result, it allows for fast convergence and robust generalization, without requiring substantial changes to the stable diffusion models. Cross-domain Attention. Using the proposed domain switcher, the diffusion model can generate two different domains. However, it is important to note that for a single view, there is no guarantee that the generated color image and the normal map will be geometrically consistent. To address this issue and ensure the consistency between the generated normal maps and color images, we introduce a crossdomain attention mechanism to facilitate the exchange of information between the two domains. This mechanism aims to ensure that the generated outputs align well in terms of geometry and appearance.

Figure 5. The qualitative results of Wonder3D on various styles of images.

The cross-domain attention layer maintains the same structure as the original self-attention layer and is integrated before the cross-attention layer in each transformer block of the UNet, as depicted in Figure 4. In the cross-domain attention layer, the keys and values from the normal and color image domains are combined and processed through attention operations. This design ensures that the generations of color images and normal maps are closely correlated, thus promoting geometric consistency between the two domains.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Xiaoxiao Long, The University of Hong Kong, VAST, MPI Informatik and Equal Contributions;

(2) Yuan-Chen Guo, Tsinghua University, VAST and Equal Contributions;

(3) Cheng Lin, The University of Hong Kong with Corresponding authors;

(4) Yuan Liu, The University of Hong Kong;

(5) Zhiyang Dou, The University of Hong Kong;

(6) Lingjie Liu, University of Pennsylvania;

(7) Yuexin Ma, Shanghai Tech University;

(8) Song-Hai Zhang, The University of Hong Kong;

(9) Marc Habermann, MPI Informatik;

(10) Christian Theobalt, MPI Informatik;

(11) Wenping Wang, Texas A&M University with Corresponding authors.