OmniFaceRig

OmniFaceRig: Fully Automatic Inner-Mouth-Aware Face Rigging Across Diverse 3D Character Topologies

OmniFaceRig fully automates 3D face rigging by converting static meshes into rigs with inner-mouth geometry and FACS blendshapes, supporting humans and diverse animals. It uniquely generates teeth, gums, and tongue with collision-aware blendshape transfer, enabling animation-ready avatars without manual input.

talking-head
face-animation
blendshape
avatar
3d-avatar

Demos

These demos show OmniFaceRig’s automatic pipeline that transforms static 3D character meshes into facial rigs with detailed inner-mouth geometry. Look for accurate facial expressions, seamless teeth and tongue integration, and versatility across humans and diverse animals with no manual setup. The results highlight quick, ready-to-animate rigs across varied character types.

OmniFaceRig teaser showing fully automatic inner-mouth-aware face rigging results on diverse 3D characters.

Authors: Chao Wang, Guangyao Ma, John Doublestein, Junming Chen, Yiming Lin, Zhaoen Su, Xiaomin Luo, Shiyang Cheng, Jie Shen, Doug Roble, Dilin Wang, Yilei Li, Rakesh Ranjan

Categories: cs.GR, cs.CV

Published 2026-06-06 · Updated 2026-06-06

Abstract

Facial rigging - creating FACS-based blendshapes together with inner-mouth geometry (teeth, gums, and tongue) - remains a major bottleneck in 3D character production. Existing pipelines still require substantial designer effort, especially for manual landmark annotation, per-character template adjustment, and inner-mouth placement. We present OmniFaceRig, a fully automatic end-to-end pipeline that converts a static surface-only 3D character mesh, with no pre-modeled oral cavity, into an inner-mouth-aware FACS rig with up to 155 blendshapes, procedurally fitted teeth, gums, and tongue, and re-packed UV/texture. OmniFaceRig supports diverse topologies - humans, humanoids, long-muzzled animals (e.g., dogs, wolves, foxes), and short-muzzled animals (e.g., cats, bears, rabbits, tigers) - with no manual landmarks, no user-provided templates, and no per-asset setup. The pipeline combines hybrid VLM+CV riggability checking, multi-model face parsing, dense keypoint-driven template registration, procedural inner-mouth construction, and collision-aware blendshape transfer. For non-human characters, OmniFaceRig selects topology-specific face and inner-mouth templates and uses collision-aware inner-mouth fitting to reduce teeth-face intersections without exposing users to category-specific tuning. We also publicly release Omni-Bench, a freely available benchmark dataset of 1,000 biped 3D characters with FACS facial blendshapes and inner-mouth geometry, spanning humans, humanoids, cats, dogs, and other animals. Experiments show high final rigging success on screened Omni-Bench inputs, nearly complete face detection recall from the segmentation ensemble and reliable inner-mouth placement with low penetration. Together, OmniFaceRig provides an automatic path from static generated characters to animation-ready facial rigs across both human and non-human topologies.

Introduction

OmniFaceRig addresses a production bottleneck that sits between character generation and animation: converting a static surface-only 3D character mesh into a fully usable facial rig with FACS blendshapes and explicit inner-mouth geometry. The paper argues that existing 3D generation systems increasingly produce visually plausible character meshes, but those meshes typically lack the deformation-ready face topology, blendshape controls, and oral-cavity structures needed for animation. In production, the missing pieces are expensive because they normally require manual landmarking, template adjustment, and hand placement of teeth, gums, and tongue. OmniFaceRig’s goal is to remove that manual work entirely while supporting both human and non-human topologies.

The core claim is a fully automatic end-to-end pipeline that starts from a static mesh with no pre-modeled oral cavity and outputs a production-ready facial rig with up to 155 blendshapes, procedurally fitted inner-mouth geometry, and re-packed UV/texture. Unlike prior systems that focus only on the outer face surface, OmniFaceRig explicitly synthesizes teeth, gums, and tongue and keeps them collision-aware during expression transfer. The method is designed to generalize across humans, humanoids, long-muzzled animals, and short-muzzled animals with no manual landmarks, no user-provided template, and no per-asset tuning.

OmniFaceRig teaser figure — OmniFaceRig takes static surface-only 3D meshes with no pre-modeled oral cavity (top row) and fully automatically produces production-ready facial rigs with FACS blendshapes and generated inner-mouth geometry including teeth, gums, and tongue (bottom rows showing expressions and mouth details). The teaser highlights our support for diverse 3D character topologies---humans, humanoids, long-muzzled animals (e.g., dogs), and short-muzzled animals (e.g., bears)---within one unified rigging pipeline. Omni-Bench is released as a freely available benchmark of rigged biped characters spanning humans and animals.

What is new relative to prior face rigging?

Cross-topology face rigging: the same pipeline is intended to work across humans, humanoids, and multiple animal face families.
Automatic oral-cavity creation: teeth, gums, and tongue are synthesized procedurally rather than copied from a hand-authored template.
Blendshape output: the final rig is a FACS-style blendshape set, not just a neutral mesh or a latent deformation representation.
Unattended operation: the system removes manual correspondence annotation and per-character template setup, which are the usual bottlenecks in rigging pipelines.
Benchmark contribution: the paper also releases Omni-Bench, a large rigged dataset meant to support evaluation of automatic facial rigging on both human and animal assets.

Omni-Bench Dataset

Omni-Bench is presented as a public benchmark of 1,000 rigged biped 3D characters. The assets originate from a text/image-to-3D generation pipeline based on AssetGen2, and the paper states that all released assets are selected from inputs that pass the initial riggability screen and are then processed by OmniFaceRig. Every asset is provided in biped T-pose and ships with the generation provenance: text prompt, 2D reference image, final mesh, FACS blendshapes, and inner-mouth geometry.

Dataset composition

Omni-Bench composition as described in the paper.
Subset	Count	Description
Humans + humanoids	500	Realistic humans across 13 occupations, plus stylized fantasy, sci-fi, and cyberpunk humanoids.
Animals	500	150 cats, 150 dogs, and 200 other rigging-amenable animals such as bears, tigers, lions, foxes, wolves, rabbits, and deer.
Total	1,000	Biped assets in T-pose, all rigged by OmniFaceRig.

The paper emphasizes that Omni-Bench is useful for more than a single rigging benchmark. Because each asset includes text prompt, intermediate 2D image, and final 3D mesh, it can also support broader text-to-3D and multimodal research. The authors also position Omni-Bench as a dataset that fills a gap in existing face datasets by combining FACS blendshapes, inner-mouth geometry, full-character assets, and animal coverage in one release.

Method Overview

OmniFaceRig is a two-stage pipeline. Stage 1 performs face template fitting: it renders the input mesh to a frontal view, detects and crops the face, parses facial regions with a multi-model segmentation ensemble, runs a hybrid VLM+CV riggability check, extracts 2D and 3D keypoints, and fits a topology-specific canonical face template to the input. Stage 2 constructs the final blendshape rig: it fuses the fitted face into the original mesh, generates and places inner-mouth geometry, rebuilds UVs and texture, and transfers FACS blendshapes from a canonical rigged template.

OmniFaceRig pipeline architecture. The pipeline consists of two stages. Stage~1 (Face Template Fitting) takes a static surface-only 3D mesh, performs face detection/cropping, face segmentation, VLM+CV riggability assessment, landmark extraction, and dense keypoint-driven rigid + non-rigid template registration, producing a fitted template mesh aligned to the input face. Stage~2 (Blendshape Rig Construction) takes the fitted template and constructs a production-ready FACS blendshape rig through face mesh fusion, generated teeth registration and texture baking, and FACS blendshape transfer, producing inner-mouth geometry (teeth, gums, and tongue) and re-packed UV/texture that are not present in the input.

Why the pipeline is structured this way

Stage 1 is perception-heavy: its purpose is to robustly localize facial structure on a wide variety of generated meshes.
Stage 2 is geometry-heavy: once a fitted template exists, the system can fuse it into the original mesh, synthesize missing oral structures, and transfer expressions.
Canonical templates reduce complexity: rather than training a fully general regressor, the method uses a small library of topology-specific templates and deformation fitting.
Inner-mouth handling is explicit: teeth, gums, and tongue are not approximated implicitly; they are built as separate geometry and then made collision-aware.

Riggability assessment

Before full processing, OmniFaceRig applies a riggability checker to filter out assets that are unlikely to succeed downstream. The checker is a hybrid of a Vision-Language Model and computer vision signals. It is not merely a binary classifier; it also chooses downstream configuration, including the face template variant and the inner-mouth archetype.

The paper describes three sources of evidence. First, the VLM handles semantic judgments such as character archetype, face covering status, mouth state, and teeth visibility. Second, segmentation masks provide geometric tests such as whether face/eye/mouth regions exist, whether they lie inside the face boundary, whether eye occlusion is excessive, and whether the eye-to-head ratio or lip-seal curvature looks plausible. Third, for human-like faces only, mouth landmarks are used but gated by a lip-evenness statistic $\mathrm{CV}_{\mathrm{lip}}$ so that unreliable landmarks do not poison the decision. The mouth decision is the only component that uses explicit fusion; for human-like characters it is a majority vote across VLM, segmentation, and landmarks when the landmark gate passes, while for non-human characters the landmark signal is ignored.

Per-question signal assignment

Signal source used by each riggability question.
Question	Checked aspect	Signal source
Q1	Face type (human / animal / other)	VLM
Q2	Number of visible faces	VLM
Q3	Number of visible eye orbits	VLM
Q4	Mouth present and localizable	Majority vote
Q5	Eye-region occlusion or obstruction	Segmentation mask
Q6	Eyewear covering eyes	VLM
Q7	Mouth state	VLM
Q8	Teeth visibility	VLM
Q9	Eye-to-head size ratio	Segmentation mask
Q10	Lip-seal curvature	Segmentation mask

The checker is conservative by design. The paper explicitly treats riggability as an input eligibility filter rather than a guarantee of final success: it rejects obvious invalid assets such as missing faces, severe eye or mouth occlusions, non-standard viewpoints, and unsupported facial topologies. Even a screened asset can still fail later because of segmentation errors, unstable template fitting, or implausible mouth localization. The validation metrics are therefore reported separately from final rigging success.

Rigability checker performance

Riggability assessment on a mixed validation set of approximately 500 assets with human ground-truth riggability labels.
Method	Accuracy	Recall	F1
VLM only	94.69%	94.76%	94.71%
CV segmentation filtering only	76.72%	94.87%	80.87%
VLM + CV (Ours)	>95%	>95%	>95%

The paper’s takeaway is that the VLM is strong at semantic understanding, while the CV side is better at rejecting geometric edge cases. Combining them gives the best overall checker.

Segmentation ensemble

For face parsing, OmniFaceRig uses a four-model ensemble that is explicitly designed to cover human and non-human topology variation. The ensemble runs face landmark detection, a base Sapiens parser, SAM 3, and fine-tuned Sapiens parsers in parallel, then selects the best output per facial region. The design is important because no single model is reliable across humans, stylized humanoids, and animals. Human face landmarks are precise on conventional faces, while the Sapiens-based models and SAM 3 cover stylized or non-human cases.

The authors also describe a Sapiens-based training recipe. Starting from Sapiens-1B, they first pretrain at $512 \times 512$ resolution and then briefly switch to $1024 \times 1024$ to sharpen facial boundary cues. They also perform mid-stage unsupervised adaptation on 20 million curated stylized character images. For fine-tuning, they freeze the ViT encoder, attach a lightweight three-stage deconvolutional decoder, and train two parser heads: a 41-class stylized parser for human-like characters and a 38-class humanoid parser for cartoon animals and strongly non-human characters.

Segmentation training details

Backbone: Sapiens-1B ViT encoder.
Training data: about 10,000 annotated frontal T-pose images covering humans, humanoids, felines, canines, ursines, and other stylized creatures.
Optimization: AdamW with learning rate $10^{-4}$, cosine decay, 500 warmup steps, batch size 16, and class-balanced cross-entropy.
Schedule: 50 epochs on 8× NVIDIA A100 GPUs, about 24 hours.
Augmentation: horizontal flip, color jitter, random rotation of $\pm 15^\circ$, and random scale of $0.8$ to $1.2$.

Template registration

The fitting stage aligns a small library of quad-mesh templates to the input asset using keypoints extracted from the ensemble masks. OmniFaceRig maintains three topology-specific face templates: human, long-muzzle, and short-muzzle. The VLM chooses among them automatically based on the character’s topology. The paper’s design choices are intentionally tuned for robustness on stylized and animal faces rather than just conventional human heads.

A central design decision is the nose-landmark-free template. The authors argue that nose contours are too unreliable across animals and stylized characters: noses are often tiny, irregular, merged with the muzzle, or incorrectly segmented. Instead of forcing nose anchors, the optimization uses surrounding facial structure to infer the nose region indirectly. This avoids unstable correspondences and makes the method more robust to diverse anatomy.

Why nose anchors are unreliable. Across five representative assets the nose region exhibits dramatic variation in both detection quality and contour shape. From left to right: (1) a human with a clean, well-detected nose (a positive baseline); (2) a human where nose detection outright fails; (3) a stylized humanoid where detection returns a mask but the contour is oddly shaped; (4) an animal where detection succeeds but the mask has the wrong overall nose shape; and (5) an animal where detection technically succeeds but the nose mask merges with the mouth into an irregular contour. Rows (top to bottom): input asset render; final nose segmentation mask; asset render with the mask overlaid. Even when detection succeeds, the strongly varying nose contour means that uniformly-sampled boundary keypoints do not correspond across assets, motivating our nose-landmark-free template design.

The template design also separates the mesh used for fitting from the mesh used for animation quality. The paper notes that the fitting template and the animated topology are decoupled and linked by UV correspondence, so the fitting mesh can be optimized for landmark matching without sacrificing edge flow around eyes and mouth. In addition, the template is a minimum region fit: it hugs the convex hull of facial features instead of extending deep into neck or chin regions, which reduces merge artifacts.

Keypoint extraction is robustified by several fallbacks. If a border keypoint projects badly, the system snaps it to the nearest valid face-mask boundary and, if necessary, searches the nearest pixel on the actual mesh surface. If 2D-to-3D projection still fails, later keypoints can fall back to the resolved 3D locations of earlier anchors. Eye masks are also selected adaptively: landmark-derived eye masks are preferred on humans, segmentation-derived eye masks are preferred on non-humans, and the system falls back automatically when needed.

Registration objective

After a rigid alignment step, the template undergoes non-rigid optimization with a composite energy. The rigid stage minimizes

$$E_{\text{rigid}} = \sum_{i \in \mathcal{K}} \| s \mathbf{R}\mathbf{v}_i + \mathbf{t} - \mathbf{p}_i \|^2,$$

where $\mathcal{K}$ is the keypoint set, $\mathbf{R}$ is rotation, $s$ is scale, $\mathbf{t}$ is translation, $\mathbf{v}_i$ are template keypoints, and $\mathbf{p}_i$ are target asset keypoints.

The non-rigid stage optimizes per-vertex offsets $\mathbf{d}_i$ under a weighted sum

$$E(\mathbf{D}) = \lambda_1 E_{\text{corr}} + \lambda_2 E_{\text{smooth}} + \lambda_3 E_{\text{edge}} + \lambda_4 E_{\text{tri}} + \lambda_5 E_{\text{flip}} + \lambda_6 E_{\text{reg}}.$$

The terms are:

Correspondence loss: pulls deformed template keypoints toward the target keypoints, using a robust penalty such as Huber loss.
Smoothness: encourages neighboring vertices to move coherently.
Edge-length preservation: keeps local geometry close to the template.
Triangle-shape preservation: reduces shearing and skinny triangles.
Flip penalty: discourages orientation reversal and foldovers.
Offset regularization: keeps the deformation from becoming unnecessarily large.

The paper reports near-zero alignment error below $10^{-5}$ on normalized meshes, and emphasizes that the nose-free design plus the flip penalty eliminate crashes on animal assets in Omni-Bench.

Inner-mouth synthesis and blendshape construction

Once the face template is fitted, Stage 2 builds the final rig. The original input mesh does not include an oral cavity, so teeth, gums, and tongue must be synthesized rather than transferred. OmniFaceRig selects one of four inner-mouth archetypes from a small library: human, canine, monster, or flat. The VLM chooses the archetype in Stage 1 based on the rendered appearance of the character.

The selected teeth template is warped into the fitted mouth using radial basis function deformation and non-uniform scaling. Placement is refined with As-Rigid-As-Possible deformation and then further corrected with a signed distance field pass that pushes intersecting face regions outward along the SDF gradient. The gums and tongue are fit at the same time. This ARAP + SDF strategy is the paper’s main mechanism for reducing teeth-face intersections both at rest and under expression deformation.

After the geometry is in place, the UV layout is rebuilt and texture is re-sampled onto the new mesh. The face region is given roughly 10× more texel density than non-face regions, and the inner-mouth surfaces are placed on a separate texture channel. The final facial expressions are transferred from a canonical FACS template using sparse deformation transfer, followed by Point Deform and Delta Mush smoothing. The authors note that this preserves character identity while letting the same expression basis work across humans, humanoids, and animals.

Expression-specific post-processing rules applied after sparse deformation transfer.
Shape category	Refinement
Closed-eye shapes	Upper-eyelid vertices are snapped to the lower lid and locally relaxed for contact-aware closure.
Eye-gaze shapes	Eye-region vertices rotate around virtual eyeball centers estimated by Procrustes fit.
Jaw-related shapes	The lower-teeth subset moves coherently with jaw motion.
Collision-aware refinement	SDF-based teeth-face penetration checks are re-applied for each expression shape.

The method supports three output tiers: Core with 13 shapes, Additional with 46 shapes, and Full with 155 shapes. The paper frames these as configurable deployment tiers rather than separate models.

Experiments and Results

All rigging and evaluation experiments are run on a single NVIDIA A100 GPU, with segmentation model training performed offline on 8× A100s. The paper reports end-to-end latency of 20 to 30 seconds per asset including data I/O. Stage 1 takes about 8 to 10 seconds, and Stage 2 takes about 10 to 15 seconds. The authors highlight this as a 2 to 3 order-of-magnitude reduction compared with manual sculpting workflows that can take hours or days.

Evaluation metrics

MAE: mean absolute per-vertex error between predicted and ground-truth expression meshes.
Q95: the 95th-percentile vertex error, used as a worst-case measure.
Penetration rate: percentage of penetrating vertices between inner-mouth components and the outer face.
Success rate: percentage of screened assets for which the full pipeline completes and passes final geometric quality checks.
Processing latency: end-to-end runtime in seconds.

Qualitative OmniFaceRig results across diverse 3D characters from Omni-Bench. Each row is one character. The leftmost column is the input static surface-only mesh with no pre-modeled oral cavity; the remaining six columns are generated expression frames labeled at the top (e.g., surprise, smile, love, anger), most involving open-mouth deformations that expose the procedurally generated inner-mouth geometry (teeth, gums, tongue) synthesized by Stage~2. Rows span the four topology families targeted by OmniFaceRig---realistic and stylized humans, humanoid characters, long-muzzled animals (dogs, wolves, foxes), and short-muzzled animals (cats, bears, tigers)---all produced by the same fully automatic pipeline. Three properties are visible across the grid: cross-topology generalization via the three-template + nose-landmark-free design; intersection-free inner-mouth geometry under strong expressions via the ARAP+SDF placement; and preserved character-specific identity from per-character sparse deformation transfer. Additional grids appear in Appendix~.

Comparison with prior methods on human/humanoid heads

Quantitative comparison with recent facial rigging methods on 200 human/humanoid heads. Baseline numbers (DT, NFR, RigAnyFace) are taken from RigAnyFace. All meshes are normalized to a unit sphere. DT requires manual correspondence annotations.
Method	MAE ↓ (mm)	Q95 ↓ (mm)	Penetration ↓ (%)
DT	2.93	8.41	--
NFR	2.77	7.21	--
RigAnyFace	1.01	2.94	0.17
OmniFaceRig (Ours)	0.85	2.50	0.05

On this 200-head evaluation, OmniFaceRig achieves the best accuracy and the lowest penetration rate. Relative to RigAnyFace, the paper reports about a 16% improvement in MAE and a 15% improvement in Q95, while penetration drops to 0.05%, a 3.4× improvement over RigAnyFace.

Omni-Bench comparison

Quantitative comparison on a 300-asset human + humanoid subset of Omni-Bench.
Method	MAE ↓ (mm)	Q95 ↓ (mm)	Penetration ↓ (%)	Success rate ↑ (%)
DT	3.12	9.05	--	82.4
NFR	2.95	8.11	--	85.1
OmniFaceRig (Ours)	0.92	2.71	0.08	99.0

The Omni-Bench evaluation is restricted to humans and humanoids because DT and NFR are human-topology methods. On this subset, OmniFaceRig retains sub-millimeter error, very low penetration, and a near-perfect final success rate. The paper explicitly notes that RigAnyFace is not re-evaluated here because its code is not publicly available. For the broader Omni-Bench set including animals, the authors report that OmniFaceRig achieves zero crashes on all animal assets.

Segmentation ablation

Contribution of each segmentation model to face detection recall on Omni-Bench.
Model config	Human	Cat	Dog	Overall
Face landmark only	92%	12%	18%	41%
Sapiens only	88%	35%	42%	55%
SAM 3 only	78%	71%	68%	72%
Fine-tuned Sapiens only	90%	89%	85%	88%
4-model ensemble (Ours)	100%	100%	97%	~99%

This ablation makes the logic of the ensemble clear. Face landmarks are very good on standard human faces, but collapse on animals. The base Sapiens parser is better than landmarks on non-human faces but still degrades due to its human-centric pretraining. SAM 3 provides the broadest single-model coverage, but at coarser boundaries. Fine-tuned Sapiens improves animal parsing substantially. The combined four-model ensemble is best because it mixes precision on humans with broad coverage on stylized and animal faces.

Nose-anchor ablation

The paper strongly motivates the nose-free template design with both qualitative and quantitative evidence. When nose anchors are included, the pipeline produces severely distorted nose-mouth regions on 78% of animal assets and crashes on 10%, driven by inaccurate or unstable nose segmentation and by the fact that uniform boundary sampling on tiny or irregular nose contours does not produce consistent correspondences. Removing nose anchors eliminates those artifacts and reduces the crash rate to 0% across all animal assets in Omni-Bench.

Latency

The full system runs in roughly 20 to 30 seconds per asset on a single A100, including I/O. This is fast enough for offline or batch generation, but the paper does not claim real-time interactivity. The authors emphasize that the system is still far faster than manual rigging, but it is not intended as a real-time editing tool.

Limitations

Failure cases. From left to right: (1)~a bird with a sharp beak and an unusual surface around the beak that segmentation cannot localize as a standard face; (2)~a fish-like face with a very wide, elongated mouth that extends almost to the face boundary, on which segmentation tends to fill in the mouth region incorrectly; (3)~an insect with oversized compound eyes located near the face boundary and an irregular mouth layout that disrupts landmark detection; (4)~a seahorse-like animal with a strongly elongated or flattened snout (or a heavily occluded mouth) whose facial topology falls outside our template families. Rows (top to bottom): input render; final segmentation mask; asset render with the mask overlaid. In all four cases, segmentation either breaks down or the underlying face/mouth anatomy cannot be reliably fit by any of our human, long-muzzle, or short-muzzle templates.

The limitation section is explicit: OmniFaceRig is robust within its supported topology families, but it is not a universal facial parameterization. The current system targets visible-faced characters whose facial topology falls within the human, long-muzzle, and short-muzzle template families. Birds with sharp beaks, fish-like faces, insects with oversized compound eyes, and seahorse-like animals with unusual snouts are outside the system’s design envelope and may be rejected by the riggability checker or fail later even if segmentation succeeds.

The paper also notes that expanding coverage will require more templates, more segmentation labels, and new riggability criteria. Even for supported families, the pipeline is best suited to offline or batch processing rather than real-time editing.

Takeaway

The main technical contribution of OmniFaceRig is not a single learned model but a carefully engineered automatic rigging system that combines semantic filtering, multi-model parsing, topology-aware template fitting, explicit oral-cavity synthesis, and collision-aware blendshape transfer. The paper’s result is a practical path from generated surface-only characters to animation-ready facial rigs across both human and non-human topologies, backed by a new benchmark and a set of ablations that isolate why each design choice matters.

OmniFaceRig

Demos

Links

Paper & demos

Code & resources

Abstract

Introduction

What is new relative to prior face rigging?