RIGI: Rectifying Image-to-3D Generation Inconsistency
via Uncertainty-aware Learning

Jiacheng Wang, Zhedong Zheng, Wei Xu, Ping Liu

Abstract

Image-to-3D generation aims to predict a geometrically and perceptually plausible 3D model from a single 2D image. Conventional approaches typically follow a cascaded pipeline: first generating multi-view projections from the single input image through view synthesis, followed by optimizing 3D geometry and appearance using these projections. However, geometric and photometric inconsistencies in the synthesized views destabilize the differentiable rendering optimization, resulting in topologically flawed meshes and perceptually unrealistic material properties, particularly near occluded areas and boundaries. These limitations stem from two critical sources of uncertainty: epistemic uncertainty, arising from incomplete viewpoint coverage, and aleatoric uncertainty, caused by noise and inconsistencies in the generated multi-view frames. To address these challenges, we propose an uncertainty-aware optimization framework that explicitly models and mitigates both uncertainty types. For epistemic uncertainty, we employ a multiple sampling strategy that dynamically varies camera elevations and progressively integrates diverse viewpoints into training, enhancing viewpoint coverage and stabilizing optimization. For aleatoric uncertainty, we estimate an uncertainty map from the discrepancies between two independently optimized Gaussian models. This map is incorporated into uncertainty-aware regularization, dynamically adjusting loss weights to suppress unreliable supervision. Furthermore, we provide a theoretical analysis of uncertainty-aware optimization by deriving a probabilistic upper bound on the expected generation error, providing insights into its effectiveness. Extensive experiments demonstrate that our method significantly reduces artifacts and inconsistencies, leading to higher-quality 3D generation.



Methodology

Overview of the RIGI pipeline, which takes a reference image as input and produces 3D assets as output. We adopt a two-stage approach, first using a multi-view video diffusion model to generate dense and high-quality frames, which are then served as pseudo-labels to guide 3D asset optimization. Specifically, we first use SV3D to generate multiple videos with a wide range of viewpoints, which serve as pseudo-labels for 3D asset optimization. Next, we introduce uncertainty-aware learning, estimating an uncertainty map by leveraging the stochasticity of two simultaneously optimized Gaussian models. Finally, we apply uncertainty-aware regularization to mitigate the impact of inconsistencies in the generated pseudo-labels, resulting in high-quality and visually impressive 3D assets.

Visual Results

Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video
Input Image
Render Video


Comparison Results

Reference
TriplaneGaussian
LGM
DreamGaussian
V3D
Hi3D
Ours


Ablation Results

Impact of Multiple Sampling

Reference
Constant Elevations
+Dynamic Elevations
+Multiple Frames
+Progressive Sampling

Design of Uncertainty Estimation

Reference
w/o Uncertainty
Learnable Uncertainty
Ensemble Uncertainty
Ours

Impact of the Uncertainty Regularization Weight

Reference
λ=0
λ=1
λ=5
λ=10