Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning


Keywords: RL finetuning for alignment, diffusion model, text-to-3D, view consistency.
Video results coming soon.
Teaser Figure

Abstract

Recent advancements in the text-to-3D task leverage finetuned text-to-image diffusion models to generate multi-view images, followed by NeRF reconstruction. Yet, existing supervised finetuned (SFT) diffusion models still suffer from multi-view inconsistency and the resulting NeRF artifacts. Although training longer with SFT improves consistency, it also causes distribution shift, which reduces diversity and realistic details. We argue that the SFT of multi-view diffusion models resembles the instruction finetuning stage of the LLM alignment pipeline and can benefit from RL finetuning (RLFT) methods. Essentially, RLFT methods optimize models beyond their SFT data distribution by using their own outputs, effectively mitigating distribution shift. To this end, we introduce Carve3D, a RLFT method coupled with the Multi-view Reconstruction Consistency (MRC) metric, to improve the consistency of multi-view diffusion models. To compute MRC on a set of multi-view images, we compare them with their corresponding renderings of the reconstructed NeRF at the same viewpoints. We validate the robustness of MRC with extensive experiments conducted under controlled inconsistency levels. We enhance the base RLFT algorithm to stabilize the training process, reduce distribution shift, and identify scaling laws. Through qualitative and quantitative experiments, along with a user study, we demonstrate Carve3D's improved multi-view consistency, the resulting superior NeRF reconstruction quality, and minimal distribution shift compared to longer SFT.

pipeline

Figure 2. Overview of Carve3D. Given a prompt sampled from the curated prompt set, we run the denoising process to obtain the final denoised image, which contains four multi-view images tiled in a 2-by-2 grid. MRC reward is computed by comparing (a) the generated multi-view images with (c) the corresponding multi-view images rendered at the same camera viewpoints from (b) the reconstructed NeRF. Then, we train the model with policy gradient loss function, where the loss is derived from the reward and log probabilities of the model’s predicted noise, accumulated across all denoising timesteps. Using only a set of prompts, this RLFT process finetunes the diffusion model with its own outputs, without relying on ground truth multi-view images.

Improving Multi-view Consistency and NeRF Quality

Teaser Figure

Figure 3. Qualitative comparison of Instant3D (the base model) and Carve3D (the model finetuned from Instant3D) on 12 prompts (in 12 blocks separated by dotted line). In each block, we show the their generated multi-view images in the 2-by-2 grid (top), the reconstructed NeRF and the extracted mesh (bottom) when given the prompt (middle). We draw red boxes on the NeRF and the extracted mesh to highlight the artifacts in the NeRF and the mesh, resulting from the inconsistencies in the multi-view images. Carve3D maintains the detailed texture and provides improved multi-view consistency and higher quality NeRF than the base Instant3D.

Maintaining Details and Realism

Teaser Figure

Figure 4. Qualitative comparison of MVDream, Instant3D with 10k, 20k, and 100k SFT steps, and Carve3D (five columns) on four prompts (four blocks separated by dotted line). In each block, we show their generated multi-view images in the 2-by-2 grid (top), reconstructed NeRF and extracted mesh (bottom) when given the prompt (middle). When compared to the base Instant3D-10K: Carve3D maintains the detailed texture and provides improved multi-view consistency and higher quality NeRF; in contrast, the models with prolonged SFT of 20K and 100K steps exhibit worse level of details and realism, while only providing slightly improved consistency.

Maintaining Diversity

Teaser Figure

Figure 5. Diverse results from original Instant3D (left) and our Carve3D (right) on 4 prompts (in 4 blocks separated by the dotted line). In each block, we show the their generated multi-view images in the 2-by-2 grid (top), the reconstructed NeRF and the extracted mesh (bottom) when given the prompt (middle). Our RLFT does not compromise the diversity of the base Instant3D model, while improving the consistency.

BibTeX

@misc{xie2023carve3d,
    title={Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning},
    author={Desai Xie and Jiahao Li and Hao Tan and Xin Sun and Zhixin Shu and Yi Zhou and Sai Bi and Sören Pirk and Arie E. Kaufman},
    year={2023},
    eprint={2312.13980},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}