IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Abstract

Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency.

In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training.

Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.

Results

Single image decomposition

Multi-view images under varying illumination

Overview

IDArb tackles intrinsic decomposition for an arbitrary number of views under unconstrained illumination. Our approach (a) achieves multi-view consistency compared to learning-based methods and (b) better disentangles intrinsic components from lighting effects via learnt priors compared to optimization-based methods. Our method could enhance a wide range of applications such as image relighting and material editing, photometric stereo, and 3D reconstruction.

Architecture

Our training batch consists of \( N\) input images, sampled from \( N_v\) viewpoints and \( N_i\) illuminations. The latent vector for each image is concatenated with Gaussian noise for denoising. Intrinsic components are divided into three triplets (\( D=3\)): Albedo, Normal and Metallic&Roughness. Specific text prompts are used to guide the model toward different intrinsic components. For attention block inside UNet, we introduce cross-component and cross-view attention module into it, where attention is applied across components and views, facilitating global information exchange.

Arb-Objaverse Dataset

Our custom dataset contains 5.7M multi-view RGB images and intrinsic components with varying illumination scenarios. For each object, we render 12 views. For each viewpoint, we render 7 images under different lighting conditions. Six images are illuminated by randomly sampled high-dynamic range (HDR) environment maps. The last image is illuminated by two point light sources randomly positioned on a surrounding shell.

BibTeX

@article{li2024idarb,
  author    = {Li, Zhibing and Wu, Tong and Tan, Jing and Zhang, Mengchen and Wang, Jiaqi and Lin, Dahua},
  title     = {IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations},
  journal   = {arXiv preprint arXiv:2412.12083},
  year      = {2024},
}