Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency.
In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training.
Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.
IDArb tackles intrinsic decomposition for an arbitrary number of views under unconstrained illumination. Our approach (a) achieves multi-view consistency compared to learning-based methods and (b) better disentangles intrinsic components from lighting effects via learnt priors compared to optimization-based methods. Our method could enhance a wide range of applications such as image relighting and material editing, photometric stereo, and 3D reconstruction.
Our training batch consists of \( N\) input images, sampled from \( N_v\) viewpoints and \( N_i\) illuminations. The latent vector for each image is concatenated with Gaussian noise for denoising. Intrinsic components are divided into three triplets (\( D=3\)): Albedo, Normal and Metallic&Roughness. Specific text prompts are used to guide the model toward different intrinsic components. For attention block inside UNet, we introduce cross-component and cross-view attention module into it, where attention is applied across components and views, facilitating global information exchange.
Our custom dataset contains 5.7M multi-view RGB images and intrinsic components with varying illumination scenarios. For each object, we render 12 views. For each viewpoint, we render 7 images under different lighting conditions. Six images are illuminated by randomly sampled high-dynamic range (HDR) environment maps. The last image is illuminated by two point light sources randomly positioned on a surrounding shell.
@article{li2024idarb,
author = {Li, Zhibing and Wu, Tong and Tan, Jing and Zhang, Mengchen and Wang, Jiaqi and Lin, Dahua},
title = {IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations},
journal = {arXiv preprint arXiv:2412.12083},
year = {2024},
}