Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

Desai Xie^1,2, Jiahao Li^1,3, Hao Tan¹, Xin Sun¹, Zhixin Shu¹, Yi Zhou¹, Sai Bi¹, Sören Pirk⁴, Arie E. Kaufman²,

¹Adobe Research ²Stony Brook University ³TTIC ⁴Kiel University

Finetunes multi-view diffusion models with RL to improve multi-view consistency and NeRF quality, without relying on ground truth multi-view dataset.
Plus a novel metric for evaluating the consistency of multi-view diffusion models.

UPDATE 02/26/2024: Carve3D has been accepted to CVPR 2024!

UPDATE 03/04/2024: Video results are available.

UPDATE 04/14/2024: Updated this website and our arXiv paper to include changes in the CVPR 2024 camera ready version.

arxiv Code & Data

Figure 1. Our Carve3D algorithm steadily improves the 3D consistency of a multi-view diffusion model and the resulting quality of the NeRF and the mesh, without sacrificing its image-prompt alignment, texture details, or realism. Here, we show 3 testingset results (in 3 rows, numbered as 1-3, separated by dotted lines) from the finetuning process (epoch 0, 28, and 55 in 3 columns). Each row includes the generated multi-view images (denoted as MV), video and a sample frame of the reconstructed NeRF and extracted mesh (frame is denoted as RM and video is denoted as RMV) and the text prompt (denoted as TP). The inconsistencies in the multi-view images, e.g. the facing direction of the shopping cart, the position of the octopus arms, and the position of the pencils, lead to artifacts in the NeRF and the mesh (highlighted in red).

Abstract

Multi-view diffusion models, obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models, have driven recent breakthroughs in text-to-3D research. However, due to the limited size and quality of existing 3D datasets, they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline.

Overview

Teaser Figure

Figure 2. Overview of Carve3D. Given a prompt sampled from our curated prompt set and a initial noisy image, we iteratively denoise the image using the UNet. The final, clean image contains four multi-view images tiled in a 2-by-2 grid. MRC reward is computed by comparing (a) the generated multi-view images with (c) the corresponding multi-view images rendered at the same camera viewpoints from (b) the reconstructed NeRF. Then, we train the model with policy gradient loss function, where the loss is derived from the reward and log probabilities of the UNet’s predictions, accumulated over all denoising timesteps. By using only a set of training text prompts, our RLFT algorithm finetunes the diffusion model by evaluating its own generated outputs, without relying on ground truth multi-view images.

Improving Multi-view Consistency and NeRF Quality

Figure 3. Qualitative comparison of Instant3D (the base model) and Carve3D (ours, finetuned from Instant3D) on 12 prompts (in 12 blocks, numbered as 1-12, separated by dotted line). In each block, we show the their generated multi-view images in the 2-by-2 grid (denoted as MV), video and a sample frame of the reconstructed NeRF and extracted mesh (frame is denoted as RM and video is denoted as RMV) when given the text prompt (denoted as TP). For each result, we use the same randomly sampled initial noise for all models to ensure the comparison is fair. We draw red boxes on the NeRF and the extracted mesh to highlight the artifacts in the NeRF and the mesh, resulting from the inconsistencies in the multi-view images. Carve3D maintains the detailed texture and provides improved multi-view consistency and higher quality NeRF than the base Instant3D.

Comparing to Base Model with Longer SFT and Existing Methods

Teaser Figure

Figure 4. Qualitative comparison of MVDream [46], Instant3D [24] with 10K (the base model), 20K, and 100K SFT steps, Carve3D (ours, finetuned from Instant3D-10K), Zero123++ [47], and SyncDreamer [28] (7 models in 7 columns) on 4 prompts (in 4 rows, numbered as 1-4, separated by dotted line). In each row, we show their generated multi-view images in the 2-by-2 grid (denoted as MV), reconstructed NeRF and extracted mesh (denoted as RM) when given the text prompt (denoted as TP). MVDream, Zero123++, and SyncDreamer generates inconsistent multi-view images and reconstruction artifacts (highlighted in red). For each result, we use the same randomly sampled initial noise for all models to ensure the comparison is fair. We let Zero123++ and SyncDreamer to use one of Carve3D’s output multi-view images as their input image conditioning. Instant3D-10K, -20K, and -100K and Carve3D demonstrates progressively better multi-view consistency and reconstruction quality. Instant3D-10K, Carve3D, Zero123++, and SyncDreamer exhibits the best texture details and realism, whereas Instant3D-20K and -100K with prolonged SFT steps compromise those qualities.

Maintaining Diversity

Teaser Figure

Figure 5. Qualitative comparison of Instant3D (the base model) and Carve3D (ours, finetuned from Instant3D) on 4 prompts (in 4 rows, numbered as 1-4, separated by the dotted line) demonstrating diversity. In each row, we show 3 results from each model, including the generated multi-view images in the 2-by-2 grid (denoted as MV), the reconstructed NeRF and the extracted mesh (denoted as bottom) when given the prompt (denoted as middle). For each result, we use the same randomly sampled initial noise for all models to ensure the comparison is fair. Our RLFT maintains the diversity of the base Instant3D model, while improving the consistency.

BibTeX

@misc{xie2023carve3d,
    title={Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning},
    author={Desai Xie and Jiahao Li and Hao Tan and Xin Sun and Zhixin Shu and Yi Zhou and Sai Bi and Sören Pirk and Arie E. Kaufman},
    year={2023},
    eprint={2312.13980},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}