IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Abstract

Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency.

In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training.

Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.

Results

Single image decomposition

Multi-view images under varying illumination

Overview

IDArb tackles intrinsic decomposition for an arbitrary number of views under unconstrained illumination. Our approach (a) achieves multi-view consistency compared to learning-based methods and (b) better disentangles intrinsic components from lighting effects via learnt priors compared to optimization-based methods. Our method could enhance a wide range of applications such as image relighting and material editing, photometric stereo, and 3D reconstruction.

Architecture

Our training batch consists of \( N\) input images, sampled from \( N_v\) viewpoints and \( N_i\) illuminations. The latent vector for each image is concatenated with Gaussian noise for denoising. Intrinsic components are divided into three triplets (\( D=3\)): Albedo, Normal and Metallic&Roughness. Specific text prompts are used to guide the model toward different intrinsic components. For attention block inside UNet, we introduce cross-component and cross-view attention module into it, where attention is applied across components and views, facilitating global information exchange.

Arb-Objaverse Dataset

Our custom dataset contains 5.7M multi-view RGB images and intrinsic components with varying illumination scenarios. For each object, we render 12 views. For each viewpoint, we render 7 images under different lighting conditions. Six images are illuminated by randomly sampled high-dynamic range (HDR) environment maps. The last image is illuminated by two point light sources randomly positioned on a surrounding shell.

BibTeX

@inproceedings{
      li2025idarb,
      title={{IDA}rb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations},
      author={Zhibing Li and Tong Wu and Jing Tan and Mengchen Zhang and Jiaqi Wang and Dahua Lin},
      booktitle={The Thirteenth International Conference on Learning Representations},
      year={2025},
      url={https://openreview.net/forum?id=uuef1HP6X7}
      }