We introduce SimAvatar, a framework designed to generate simulation-ready clothed 3D human avatars from a text prompt. Current text-driven human avatar generation methods either model hair, clothing, and the human body using a unified geometry or produce hair and garments that are not easily adaptable for simulation within existing simulation pipelines. The primary challenge lies in representing the hair and garment geometry in a way that allows leveraging established prior knowledge from foundational image diffusion models (e.g., Stable Diffusion) while being simulation-ready using either physics or neural simulators. To address this task, we propose a two-stage framework that combines the flexibility of 3D Gaussians with simulation-ready hair strands and garment meshes. Specifically, we first employ three text-conditioned 3D generative models to generate garment mesh, body shape and hair strands from the given text prompt. To leverage prior knowledge from foundational diffusion models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair strands and learn the avatar appearance through optimization. To drive the avatar given a pose sequence, we first apply physics simulators onto the garment meshes and hair strands. We then transfer the motion onto 3D Gaussians through carefully designed mechanisms for each body part. As a result, our synthesized avatars have vivid texture and realistic dynamic motion. To the best of our knowledge, our method is the first to produce highly realistic, fully simulation-ready 3D avatars, surpassing the capabilities of current approaches.
我们提出了 SimAvatar,一个从文本提示生成可用于仿真的穿衣3D人类头像的框架。目前的基于文本的人类头像生成方法要么通过统一几何建模头发、衣物和人体,要么生成的头发和衣物难以适配现有仿真管线中的物理或神经模拟器。核心挑战在于以适合仿真的方式表示头发和衣物几何,同时利用基础图像扩散模型(如 Stable Diffusion)的先验知识。 为解决这一任务,我们提出了一个两阶段框架,将3D高斯投影的灵活性与可仿真的头发丝和衣物网格相结合。具体而言,第一阶段利用三个基于文本条件的3D生成模型,从文本提示生成衣物网格、身体形状和头发丝。为利用扩散模型的先验知识,我们将3D高斯投影附加到身体网格、衣物网格以及头发丝上,并通过优化学习头像的外观。 在驱动头像进行姿态序列动作时,我们首先对衣物网格和头发丝应用物理模拟器,然后通过为各身体部位精心设计的机制将动作转移到3D高斯上。最终,我们生成的头像具有生动的纹理和逼真的动态动作。 据我们所知,SimAvatar 是首个能够生成高度真实且完全可仿真的3D头像的框架,其能力超越了现有方法,为仿真和动画领域带来了显著进步。