Our model starts from an input image (e.g., generated from a T2I model). It feeds the latter into an image-to-video diffusion model to generate a turn-table like video. The latter is plugged into 3D Gaussian Splatting to directly reconstruct the 3D object using image-based losses for robustness. Optionally, renders of the objects are generated and fed back to the video diffusion model, repeating the process for refinement.
Human Evaluation
We perform human evaluation of IM-3D versus the state-of-the-art in Image-to-3D and Text-to-3D. Human raters prefer IM-3D to all competitors with regard to both generation quality and faithfulness, often by a large margin.