Akapulu Labs logo Akapulu Labs Research

AvatarMix

AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization

AvatarMix — method overview

AvatarMix is a compositional method for 3D avatar outfit personalization that preserves both identity and garment quality by directly combining head and body from two Gaussian avatars. It uses a two-tier diffusion refinement and mesh retargeting to ensure seamless joins and adapt garments to diverse body shapes.

  • 3d-avatar
  • avatar
  • full-body
  • gaussian-splatting
  • face-reconstruction

Demos

AvatarMix demos highlight identity-preserving outfit personalization by combining a user's face, body shape, and skin tone with a model's clothed 3D Gaussian avatar. Look for seamless joins, preserved garment details, natural body retargeting, and free-viewpoint consistency. Comparisons show improved texture fidelity and fewer artifacts versus prior methods.

Authors: Zhaorong Wang, Yoshihiro Kanamori, Yuki Endo

Categories: cs.CV, cs.GR

Comment: CVPR 2026 Findings. 16 pages, including supplementary material

Published 2026-06-02 · Updated 2026-06-02

Abstract

Existing 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user's body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user's physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: https://larsph.github.io/avatarmix/


1. Problem Setting and Core Idea

AvatarMix addresses 3D avatar identity transfer for outfit personalization: given two high-fidelity 3D Gaussian avatars, one representing the user and one representing the model, the goal is to transfer the user’s identity cues—face, head/neck region, body shape, and skin tone—onto the model’s outfit without sacrificing garment fidelity or introducing cross-layer intersection artifacts. The paper positions this as a distinct task from 2D garment-image try-on and from layer-based 3D clothing transfer. Existing approaches either lift 2D edits into 3D and therefore inherit multi-view inconsistency and appearance degradation, or explicitly separate body and clothing layers and then struggle with mesh intersections and exposed-skin rendering when coverage differs.

The central design choice in AvatarMix is compositional rather than generative: instead of synthesizing an edited avatar from scratch, it directly composes the final result from two existing 3D Gaussian avatars. The user avatar contributes the head, neck, body shape, and skin tone; the model avatar contributes the clothed body and garment geometry. This preserves the original 3D consistency and high-frequency detail of both sources by construction, while shifting the remaining problem to two targeted issues: making the head–body join seamless and adapting the model body to the user’s physique.

Overview of AvatarMix. Given multi-view images of a ser and a odel, we first Mesh-Based Avatar Reconstruction ( ) . We then perform Cross-Avatar Geometric Composition ( ) by aligning the user’s head and neck to the odel’s pose (Head & Neck Alignment) and reshaping the odel’s clothed body via our GSReshape module (Body Alignment & Reshaping) so that the body geometry matches the ser’s physique, yielding a composite mesh-based Gaussian avatar. Finally, our two-tier diffusion refinement (SeamFix for localized head–neck seams and optional FullbodyFix for full-body garment/skin artifacts; Sec.~) operates on rendered views, followed by 3D Gaussian fine-tuning, to produce the final result.
Overview of AvatarMix. Given multi-view images of a ser and a odel, we first Mesh-Based Avatar Reconstruction ( ) . We then perform Cross-Avatar Geometric Composition ( ) by aligning the user’s head and neck to the odel’s pose (Head & Neck Alignment) and reshaping the odel’s clothed body via our GSReshape module (Body Alignment & Reshaping) so that the body geometry matches the ser’s physique, yielding a composite mesh-based Gaussian avatar. Finally, our two-tier diffusion refinement (SeamFix for localized head–neck seams and optional FullbodyFix for full-body garment/skin artifacts; Sec.~) operates on rendered views, followed by 3D Gaussian fine-tuning, to produce the final result.

The method is organized into three stages: (1) mesh-based avatar reconstruction with semantic segmentation; (2) cross-avatar geometric composition, including head/neck alignment and body reshaping; and (3) diffusion refinement on rendered images, using a localized seam fixer and an optional whole-body restoration module. The overall aim is to preserve the model’s outfit quality, the user’s identity, and multi-view consistency simultaneously.

2. Representation and Reconstruction

AvatarMix requires high-fidelity source avatars with explicit geometry. The paper therefore reconstructs each subject as a mesh-based Gaussian avatar. Given multi-view images of the user and the model, NeuS2 is used to reconstruct a mesh $\mathcal{M} \in \mathbb{R}^{N_v \times 3}$ for each avatar. The authors specifically choose a mesh representation because the outfit personalization stage needs a geometric substrate that can be retargeted robustly across diverse body shapes, including loose clothing and non-watertight surfaces.

On top of the mesh, they place 3D Gaussians using the SplattingAvatar formulation: each mesh vertex $\mathbf{v}_i$ is associated with a Gaussian whose position is constrained to that vertex, along with covariance, color, and opacity. This keeps the representation photorealistic while preserving an explicit structural link to the mesh, which later enables retargeting and editing.

For semantic decomposition, the paper uses a modified 4D-Dress pipeline to segment vertices into head, torso skin, left/right arm skin, left/right leg skin, and clothing. The modification removes SAM-based voting to improve face and hand separation, especially to avoid over-aggregation of skin regions. The pipeline also assumes fitted SMPL-X parameters for each subject, which act as a body model and pose guide during composition.

3. Cross-Avatar Geometric Composition

3.1 Head and Neck Alignment

The user’s head and neck are transferred to the model pose rather than synthesized. The paper first transfers linear blend skinning weights from the user’s SMPL-X model to the high-resolution user mesh via nearest-neighbor mapping, then reposes the mesh to match the model’s pose. The head and neck vertices are extracted using the semantic segmentation. Including the neck is deliberate: it preserves more identity information, and any mismatch at the boundary is later handled by diffusion refinement.

3.2 GSReshape: Body Reshaping via Mesh Retargeting

The most technically involved component is GSReshape, which adapts a garment retargeting method originally designed for fitting clothes to a body shape. In AvatarMix, the method is repurposed to reshape the model’s clothed body mesh, including both clothing and exposed skin, so that it matches the user’s body shape represented by the user’s SMPL-X mesh. Because the retargeting is performed in explicit mesh space, the deformation can be transferred back to the attached Gaussians and thus to the rendered avatar.

The paper highlights three implementation challenges and corresponding fixes:

  • Hand-aware skin tightness. The original retargeting objective can overconstrain the hands. High fit weights on hand skin can push the mesh unnaturally and distort attached Gaussians, while low fit weights can produce a glove-like mismatch. AvatarMix resolves this by removing hand geometry from the SMPL-X mesh during retargeting, which prevents the SDF-based barrier from inflating the hands, and by using low fit weights on hand vertices. The method intentionally sacrifices explicit hand-shape adaptation to preserve visual cleanliness; the authors acknowledge this as a limitation.
  • Intersection-free initialization. The retargeting method requires an initialization in which the clothed mesh and skeleton do not intersect. Because clothed avatars intersect the skeleton more easily than isolated garments, the paper optimizes the SMPL skeleton vertices with an as-rigid-as-possible deformation so that they remain inside the clothed avatar while preserving bone rigidity.
  • Computational efficiency. The optimization is carried out on simplified meshes, and the resulting deformation is transferred to the original high-resolution mesh by nearest-surface projection. The Gaussians move with the mesh deformation.

The supplementary material gives the body-reshaping objective in more detail. The skeleton optimization combines three terms: an inside penalty that keeps bone samples inside the clothed mesh’s signed distance field, a bone-length regularizer, and a root anchor term. In the notation of the paper, if $X = \{x_k\}_{k=1}^{N_s}$ denotes the skeleton vertices and $\phi(\cdot)$ is the SDF of the clothed mesh, then the optimization uses

$$ E_{\text{pre}}(X) = w_{\text{inside}} E_{\text{inside}}(X) + w_{\text{len}} E_{\text{length}}(X) + w_{\text{anch}} E_{\text{anchor}}(X), $$

with paper-reported weights $w_{\text{inside}} = 50.0$, $w_{\text{len}} = 5.0$, and $w_{\text{anch}} = 10.0$, and a margin $\delta = 0.1$ in the inside penalty.

Once reshaping is complete, the user’s aligned head and neck replace the model’s corresponding region, and the method also performs a global skin-tone transfer in Lab color space. The skin-tone transfer computes opacity-weighted means and variances of the user face and model skin Gaussians, then applies a channel-wise affine transform so the model body’s skin color matches the user’s while largely preserving shading and high-frequency detail.

The result is a composite mesh-based Gaussian avatar that combines the user’s identity with the model’s outfit while retaining a consistent 3D representation.

GSReshape pipeline overview. From left to right: starting from the model's low-resolution clothed mesh (top left) and SMPL-X mesh (bottom left), we project the SMPL mesh to skeleton, inflating the SMPL mesh while jointly optimizing the clothed mesh, following the retargeting method of Huang ~. After retargeting, we compute vertex offsets between input and retargeted clothed mesh, and transfer these offsets to the original high-resolution clothed mesh via nearest-surface projection. The Gaussians defined on high-resolution mesh are updated as well.
GSReshape pipeline overview. From left to right: starting from the model's low-resolution clothed mesh (top left) and SMPL-X mesh (bottom left), we project the SMPL mesh to skeleton, inflating the SMPL mesh while jointly optimizing the clothed mesh, following the retargeting method of Huang ~. After retargeting, we compute vertex offsets between input and retargeted clothed mesh, and transfer these offsets to the original high-resolution clothed mesh via nearest-surface projection. The Gaussians defined on high-resolution mesh are updated as well.
Hand-aware skin tightness examples. First row: high fit weight produces Gaussian artifacts (left) versus our hand shape preserving method (right). Second row: low fit weight creates glove-like hands (left) versus our approach (right). Our semantic weighting strategy achieves better balance between visual fidelity and robustness.
Hand-aware skin tightness examples. First row: high fit weight produces Gaussian artifacts (left) versus our hand shape preserving method (right). Second row: low fit weight creates glove-like hands (left) versus our approach (right). Our semantic weighting strategy achieves better balance between visual fidelity and robustness.

4. Diffusion Refinement: SeamFix and FullbodyFix

Cross-avatar composition is geometrically valid, but the paper emphasizes two visual failure modes that remain after composition and retargeting: seam artifacts at the head–neck boundary and appearance degradation in the clothed body after reshaping. AvatarMix addresses these with a two-tier diffusion refinement strategy operating on rendered images from the already 3D-consistent Gaussian avatar. This is an important design choice: the refiner does not edit raw 2D inputs before reconstruction, which reduces the opportunity for multi-view inconsistency.

4.1 SeamFix

SeamFix is a localized diffusion module that targets only the hair and neck region. Its training data are generated through a double-swapping procedure. Starting from two avatars $A$ and $B$, the pipeline first composes $A \rightarrow B$, then reverses the process with $B \rightarrow A$. This creates realistic seam artifacts without manual annotation. Because the two composition operations return the geometry to alignment with the ground truth, the rendered double-swapped head and neck can serve as noisy inputs while the original avatar renderings provide supervision.

The paper also augments training by using 2D segmentation masks extracted from the second swap rendering. These masks may have missing neck pixels, which helps simulate real segmentation failures. The refinement crop includes the head and neck, with a dilated boundary to provide collar context. At inference, the refined crop is pasted back into the original image using feathered blending.

4.2 FullbodyFix

FullbodyFix is an optional full-body restoration module used when body reshaping visibly degrades the clothed body. It is trained on full-body renders from double-swapped avatars, which contain a mixture of garment and skin artifacts induced by repeated composition. Unlike SeamFix, it restores the full human region rather than only the seam. The paper notes that FullbodyFix is applied manually in the current implementation when visual inspection suggests it is needed; no automatic trigger is used.

4.3 Backbone and Training Design

Both SeamFix and FullbodyFix are implemented on top of the pretrained Difix3D+ backbone. The paper freezes the original Difix LoRA weights and adds new trainable LoRA adapters to the UNet and VAE decoder. The two refinement modules share this structure but use different adapter capacities: SeamFix uses rank-8 adapters, while FullbodyFix uses rank-16 adapters. The supplementary material further states that rank-4 LoRA adapters are attached to the VAE decoder and that the VAE skip connections are fine-tuned in the Difix3D style.

Training strategy for SeamFix and FullbodyFix. Top: starting from two avatars A and B, we deliberately introduce segmentation errors by using the original 4D-Dress with SAM voting and perform a first head-swap A$ $B, which produces Gaussian artifacts at the head, neck, and hands. After re-segmentation and a second head-swap B$ $A, we obtain double-swapped avatars that are pixel-aligned with the ground-truth avatars but exhibit diverse composition-induced artifacts. Bottom: using the ground-truth renderings as supervision, SeamFix is trained on portrait crops where the artifact neck from the double-swapped avatar is combined with the ground-truth face and body under 2D masks, and FullbodyFix is trained on full-body renders. Both branches fine-tune LoRA adapters of a Difix-based diffusion backbone to correct artifacts from cross-avatar composition.
Training strategy for SeamFix and FullbodyFix. Top: starting from two avatars A and B, we deliberately introduce segmentation errors by using the original 4D-Dress with SAM voting and perform a first head-swap A$ $B, which produces Gaussian artifacts at the head, neck, and hands. After re-segmentation and a second head-swap B$ $A, we obtain double-swapped avatars that are pixel-aligned with the ground-truth avatars but exhibit diverse composition-induced artifacts. Bottom: using the ground-truth renderings as supervision, SeamFix is trained on portrait crops where the artifact neck from the double-swapped avatar is combined with the ground-truth face and body under 2D masks, and FullbodyFix is trained on full-body renders. Both branches fine-tune LoRA adapters of a Difix-based diffusion backbone to correct artifacts from cross-avatar composition.

In the supplementary implementation details, SeamFix is trained for 10 epochs and FullbodyFix for 5 epochs on approximately 19k multi-view double-swapped samples generated from the THUman2.0 training subjects, with batch size 1. SeamFix operates on a cropped square head-and-neck region resized to $512 \times 512$, and FullbodyFix uses a tight body crop resized to $488 \times 896$. Training is done on a single NVIDIA RTX A6000 Ada GPU, taking roughly 16 hours for SeamFix and 28 hours for FullbodyFix.

Ablation of diffusion refinement and GSReshape. Left: rows show compositions without SeamFix, with SeamFix, without FullbodyFix, and with FullbodyFix for three user--model pairs, illustrating how SeamFix cleans head--neck seams and FullbodyFix restores garment appearance while preserving face and outfit details. Right: for two subjects, we compare the reference model body, composition without GSReshape, and composition with GSReshape; heads are cropped in the latter two columns so that differences in body shape and garment fit are easier to see.
Ablation of diffusion refinement and GSReshape. Left: rows show compositions without SeamFix, with SeamFix, without FullbodyFix, and with FullbodyFix for three user--model pairs, illustrating how SeamFix cleans head--neck seams and FullbodyFix restores garment appearance while preserving face and outfit details. Right: for two subjects, we compare the reference model body, composition without GSReshape, and composition with GSReshape; heads are cropped in the latter two columns so that differences in body shape and garment fit are easier to see.

5. Experimental Setup

5.1 Dataset and Pair Construction

Evaluation is performed on the THUman2.0 dataset, which contains 526 reconstructed clothed human subjects with diverse body shapes, clothing styles, and poses. Following the split used by VTON360, the paper uses 110 subjects as the test set and the remaining subjects for training the reconstruction and diffusion networks. For the 110 test subjects, the authors create user–model pairs by random identity/garment sampling and render 36 viewpoints per pair. AvatarMix and VTON360 are evaluated on all test pairs and all 36 views; TIP-Editor is evaluated on a subset of pairs under the same view protocol.

5.2 Baselines

The paper compares against two recent baselines that address related but not identical tasks: VTON360 and TIP-Editor. VTON360 performs 3D virtual try-on by applying upper garments to target avatars via per-view 2D generation followed by 3D lifting. TIP-Editor is a localized 3D Gaussian splatting editing method that can be configured for head replacement. AvatarMix differs in that it performs explicit cross-avatar identity transfer with body-shape adaptation and seam/appearance refinement.

5.3 Metrics

Three quantitative metrics are reported on per-view edited images, plus a user study. Editing Target DINO measures preservation of the edited region using DINO similarity against the relevant reference: the upper garment for VTON360 and the clothed body for AvatarMix. Head and Neck DINO measures facial identity and seam quality using a head-and-neck mask against the ground-truth user avatar. Warping-based RMSE measures multi-view consistency by warping neighboring views into each other using dense correspondences and computing image-space RMSE. Lower RMSE indicates better view consistency. The paper explains that the CLIP Direction Consistency Score used by VTON360 is not well matched to the identity-transfer setting because pose and body changes can occur, so the authors choose a more direct image-space metric.

The user study involves 23 participants and asks forced-choice questions over 15 identity-transfer cases per participant, covering overall realism, view consistency, and facial/seam quality.

6. Quantitative Results

On THUman2.0, AvatarMix reports the best result on every metric it is evaluated on. The numbers below are taken directly from the paper’s main quantitative table.

Method Editing Target DINO $\uparrow$ Head+Neck DINO $\uparrow$ Warping RMSE $\downarrow$ Vote Overall $\uparrow$ Vote Consistency $\uparrow$ Vote Facial $\uparrow$
VTON360 0.633 0.786 0.0276 8.70% 10.43% 7.83%
TIP-Editor N/A 0.356 0.0388 2.61% 2.61% 0%
AvatarMix 0.883 0.818 0.0175 88.69% 86.96% 92.17%

The paper’s interpretation is straightforward: AvatarMix preserves the edited region better than the baselines, preserves head/neck identity better than the baselines, and yields the lowest warping RMSE, which indicates stronger multi-view consistency. The user study mirrors the quantitative results, with AvatarMix receiving overwhelming preference for overall realism, consistency, and facial quality.

Qualitative comparison with TIP-Editor and VTON360. For each user--model pair, we show the input user and model images (front/back under two lighting conditions), followed by three-view outputs of TIP-Editor, VTON360, and AvatarMix. Zoomed insets highlight faces and garment regions, and red dashed boxes mark typical failure cases of existing methods, including view inconsistency, unnatural garment wrinkles, and degraded hands. AvatarMix better preserves facial identity, garment texture, and seam quality while avoiding these artifacts.
Qualitative comparison with TIP-Editor and VTON360. For each user--model pair, we show the input user and model images (front/back under two lighting conditions), followed by three-view outputs of TIP-Editor, VTON360, and AvatarMix. Zoomed insets highlight faces and garment regions, and red dashed boxes mark typical failure cases of existing methods, including view inconsistency, unnatural garment wrinkles, and degraded hands. AvatarMix better preserves facial identity, garment texture, and seam quality while avoiding these artifacts.

The qualitative comparisons emphasize three recurring strengths: the user’s facial identity is preserved, SeamFix removes visible neck discontinuities, and the outfit remains close to the source model garment. The comparison also shows that VTON360 can introduce view inconsistency and garment artifacts, while TIP-Editor can preserve only the head region and does not solve outfit personalization in the same way.

Additional comparisons with THUman2.0. We compare AvatarMix with baselines on more user-model pairs, demonstrating superior preservation of identity and outfit across diverse views.
Additional comparisons with THUman2.0. We compare AvatarMix with baselines on more user-model pairs, demonstrating superior preservation of identity and outfit across diverse views.

7. Ablation Findings

The ablation study isolates the effect of the two refinement modules and GSReshape. Without SeamFix, the head–neck join exhibits visible artifacts; adding SeamFix cleans the seam while leaving the unaffected regions intact. Without FullbodyFix, some examples keep the garment appearance degradation introduced by body reshaping; FullbodyFix restores the outfit more faithfully. On the geometric side, removing GSReshape leaves the model body shape largely unchanged and can lead to poor fit for the user physique, whereas GSReshape adapts the garment and body consistently, producing smoother alignment to the user’s shape while preserving clothing details.

The supplementary hand examples further support the claim that the semantic weighting strategy in GSReshape strikes a better balance than either extreme: overly strong fit weights can cause Gaussian artifacts, while overly weak weights produce glove-like hands. The paper’s chosen design reduces these failures but does not fully model hand-shape adaptation.

Additional ablation on GSReshape. We visualize the effect of our body reshaping module by comparing the model avatars without GSReshape versus with GSReshape. As shown in the with GSReshape results, the garment adapts smoothly to the user's body shape while preserving details after the body reshaping.
Additional ablation on GSReshape. We visualize the effect of our body reshaping module by comparing the model avatars without GSReshape versus with GSReshape. As shown in the with GSReshape results, the garment adapts smoothly to the user's body shape while preserving details after the body reshaping.

8. Limitations and Future Work

The paper is explicit about two limitations. First, GSReshape does not explicitly model detailed hand-shape adaptation, so extreme body-shape differences can still create mismatches around the hands. Second, very loose garments or highly complex accessories may remain difficult for the retargeting stage and can lead to wrinkles or folds that differ from the original model outfit.

The authors suggest two future directions: evaluating on datasets beyond THUman2.0 to test generalization across broader clothing styles, and extending the framework to avatar reposing from reconstructed 3D Gaussians so that users can control pose after personalization.

9. Takeaway

AvatarMix’s main contribution is a practical paradigm shift: it treats outfit personalization as a composition problem over high-fidelity 3D Gaussian avatars rather than as 2D garment synthesis or layered 3D garment modeling. By combining explicit geometric composition, mesh-based body reshaping, and diffusion-based seam/full-body refinement, the method achieves stronger identity preservation, better garment fidelity, and better multi-view consistency than the reported baselines on THUman2.0.