Image-to-3D generation aims to predict a geometrically and perceptually plausible 3D model from a single 2D image. Conventional approaches typically follow a cascaded pipeline: first generating multi-view projections from the single input image through view synthesis, followed by optimizing 3D geometry and appearance using these projections. However, geometric and photometric inconsistencies in the synthesized views destabilize the differentiable rendering optimization, resulting in topologically flawed meshes and perceptually unrealistic material properties, particularly near occluded areas and boundaries. These limitations stem from two critical sources of uncertainty: epistemic uncertainty, arising from incomplete viewpoint coverage, and aleatoric uncertainty, caused by noise and inconsistencies in the generated multi-view frames. To address these challenges, we propose an uncertainty-aware optimization framework that explicitly models and mitigates both uncertainty types. For epistemic uncertainty, we employ a multiple sampling strategy that dynamically varies camera elevations and progressively integrates diverse viewpoints into training, enhancing viewpoint coverage and stabilizing optimization. For aleatoric uncertainty, we estimate an uncertainty map from the discrepancies between two independently optimized Gaussian models. This map is incorporated into uncertainty-aware regularization, dynamically adjusting loss weights to suppress unreliable supervision. Furthermore, we provide a theoretical analysis of uncertainty-aware optimization by deriving a probabilistic upper bound on the expected generation error, providing insights into its effectiveness. Extensive experiments demonstrate that our method significantly reduces artifacts and inconsistencies, leading to higher-quality 3D generation.
Overview of the RIGI pipeline, which takes a reference image as input and produces 3D assets as output. We adopt a two-stage approach, first using a multi-view video diffusion model to generate dense and high-quality frames, which are then served as pseudo-labels to guide 3D asset optimization. Specifically, we first use SV3D to generate multiple videos with a wide range of viewpoints, which serve as pseudo-labels for 3D asset optimization. Next, we introduce uncertainty-aware learning, estimating an uncertainty map by leveraging the stochasticity of two simultaneously optimized Gaussian models. Finally, we apply uncertainty-aware regularization to mitigate the impact of inconsistencies in the generated pseudo-labels, resulting in high-quality and visually impressive 3D assets.