Summary: IM-3D generates high-quality 3D assets from single images without SDS by fine-tuning video diffusion models.

Interactive Results

Explore reconstruction results (Gaussian Splats) below.


A diagram explaining the method in broad strokes, like explained in the caption.
Our model starts from an input image (e.g., generated from a T2I model). It feeds the latter into an image-to-video diffusion model to generate a turn-table like video. The latter is plugged into 3D Gaussian Splatting to directly reconstruct the 3D object using image-based losses for robustness. Optionally, renders of the objects are generated and fed back to the video diffusion model, repeating the process for refinement.

Human Evaluation

A figure showing a human evaluation versus other methods.
We perform human evaluation of IM-3D versus the state-of-the-art in Image-to-3D and Text-to-3D. Human raters prefer IM-3D to all competitors with regard to both generation quality and faithfulness, often by a large margin.