Akapulu Labs logo Akapulu Labs Research

HumanNOVA

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

HumanNOVA — method overview

HumanNOVA is a photorealistic, universal, and rapid method for creating 3D human avatars from a single image without test-time optimization. It uses large-scale synthetic and real training data plus token-conditioned feed-forward modeling, enabling fast and robust 3D human reconstructions in diverse conditions.

  • 3d-avatar
  • full-body
  • avatar
  • multimodal
  • realtime
  • one-shot

Demos

These demos showcase HumanNOVA's rapid, photorealistic 3D human avatar modeling from a single image without requiring test-time tuning. Watch for the detailed texture quality, accurate geometry, and robustness to diverse poses and viewpoints, demonstrating superior synthesis from the feed-forward, token-conditioned framework enabled by large-scale training data.

Authors: Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan C. Liu, Zhiwen Fan, Kai Wang, Zhangyang Wang, Georgios Pavlakos

Categories: cs.CV

Comment: CVPR 2026 Highlight

Published 2026-06-01 · Updated 2026-06-01

Abstract

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at https://HumanNOVA.github.io .


1. Problem Setting and Main Idea

HumanNOVA addresses single-image 3D human avatar reconstruction, with the explicit goals of being photorealistic, universal, and rapid. The task is ill-posed because an image only reveals one side of the person, yet the model must infer hidden geometry, clothing, and texture. The paper argues that prior human-avatar methods often rely on per-instance optimization, diffusion-based hallucination, or multi-stage pipelines that are too slow for interactive use. HumanNOVA instead follows the feed-forward large reconstruction model paradigm and aims to recover a complete avatar in less than one second, with no test-time optimization.

Paper figure 'Figure1'
Photorealistic, universal and rapid 3D human avatar modeling from a single image by the proposed approach, HumanNOVA. It benefits from both our generated large-scale data and feed-forward model design. Our data generation pipeline expands training data by 20 times (top-left for visualization). With this data, HumanNOVA achieves superior performance while maintaining rapid inference among existing methods (top-right). Once trained, it is universal without the need for test-time fine-tuning or adaptation. Qualitative results show that HumanNOVA produces more precise photorealistic reconstructions compared to the state-of-the-art SiTH method.

The core observation is that scaling data and injecting human-specific priors can move large reconstruction models from general objects to humans. HumanNOVA combines an input RGB image with an estimated simplified human mesh, specifically an SMPL mesh predicted by an off-the-shelf estimator, and maps these conditions to a triplane representation that is decoded into a 3D avatar.

2. Why This Problem Is Hard

The paper highlights two bottlenecks for human-oriented large reconstruction models. First, high-quality 3D human training data is scarce relative to general 3D object datasets such as Objaverse. Second, generic large reconstruction architectures are not built with human priors, even though humans have strong structure, pose, and clothing regularities that should be exploited. HumanNOVA therefore treats both data and architecture as necessary ingredients.

The intended use cases include virtual reality, telepresence, and human-computer interaction, where fast reconstruction and realistic rendering matter. The authors also explicitly note the task is difficult under in-the-wild inputs, unusual viewpoints, occlusions, and challenging garments such as dresses and overalls.

3. Overview of the HumanNOVA Pipeline

HumanNOVA follows a single feed-forward reconstruction pipeline with three major stages: multi-modal encoding, 2D-to-3D mapping, and rendering. The model consumes an RGB image $I \in \mathbb{R}^{H \times W \times 3}$ and an estimated SMPL mesh. The image is tokenized with DINOv2 into visual tokens, and the mesh is tokenized with PTv3 into mesh tokens. These token sets are fused by a mapping network into a triplane $\mathbf{T} \in \mathbb{R}^{3hw \times d}$, from which images can be rendered by standard ray marching at any target viewpoint.

~network architecture. Given a real-world input image, we first estimate its corresponding simplified human mesh. Image and mesh are fed into the multi-modal encoder to extract features which are utilized as the condition for the following mapping network. After that, a Transformer-based mapping network directly maps the features to the 3D triplane representation. From this triplane representation, our framework can render the 2D image given a camera viewpoint.
~network architecture. Given a real-world input image, we first estimate its corresponding simplified human mesh. Image and mesh are fed into the multi-modal encoder to extract features which are utilized as the condition for the following mapping network. After that, a Transformer-based mapping network directly maps the features to the 3D triplane representation. From this triplane representation, our framework can render the 2D image given a camera viewpoint.

3.1 Multi-modal tokenization

The image encoder is DINOv2, which converts the input image into feature tokens $\mathbf{f}_i \in \mathbb{R}^{N_i \times d}$. The mesh prior comes from an estimated SMPL body surface, tokenized by PTv3 into mesh tokens $\mathbf{f}_m \in \mathbb{R}^{N_m \times d}$. The mesh is described as a coarse but robust human shape and pose prior rather than detailed geometry or appearance.

The paper emphasizes that this prior is helpful but not brittle: it improves structural reconstruction, yet the system can still lean on appearance cues when the mesh estimate is imperfect.

3.2 Mapping tokens to a triplane

The mapping module is based on PointInfinity-style transformer blocks and is inspired by SF3D. It updates a learnable triplane token set through cross-attention between the condition tokens and the triplane tokens. In each block, the paper describes a three-step fusion/refinement process:

$$ \mathbf{L}^l = \operatorname{CrossAttn}(\text{q} = \mathbf{f}_i \| \mathbf{f}_m,\, \text{kv} = \mathbf{T}^l), $$ $$ \mathbf{L}^l = \operatorname{CrossAttn}(\text{q} = \mathbf{L}^l,\, \text{kv} = \mathbf{f}_i \| \mathbf{f}_m), $$ $$ \mathbf{T}^{l+1} = \operatorname{CrossAttn}(\text{q} = \mathbf{T}^l,\, \text{kv} = \mathbf{L}^l). $$

Conceptually, the condition tokens query the current triplane, the resulting latent is refined by querying the conditions again, and then the triplane is updated from that refined latent. This design lets the model directly lift 2D appearance and coarse body structure into a 3D latent volume.

3.3 Triplane rendering

After mapping, the triplane is rendered with the standard ray-marching procedure used by large reconstruction models. Given a target camera viewpoint $\Phi$, the renderer produces an image $\hat{I}_\Phi = \pi(\mathbf{T}, \Phi)$. The paper uses a NeRF-style MLP decoder for volumetric rendering, with 10 layers, width 60, SiLU activations, and 128 samples per ray.

3.4 Training losses

HumanNOVA is trained with a weighted sum of RGB reconstruction loss, mask loss, and LPIPS loss:

$$ \mathcal{L} = \frac{1}{N} \sum_{n=1}^{N} \left( \mathcal{L}_r^n + \lambda_m \mathcal{L}_m^n + \lambda_p \mathcal{L}_p^n \right), $$

where $N$ is the number of rendered supervision views. The paper sets $\lambda_m = 0.5$ and $\lambda_p = 0.5$. The mask loss enforces consistency between the accumulated density and the foreground mask, while RGB and LPIPS encourage accurate appearance and perceptual fidelity. To reduce memory usage, the authors compute the losses on foreground-biased image patches rather than full images.

4. Data Generation: The Main Enabler

The paper’s main technical contribution is not just the model, but the data pipeline used to make human reconstruction feasible at LRM scale. The authors generate a training set of about 100k assets in total, combining synthetic and real-world sources. This is reported as roughly 20 times larger than the combined size of the existing human datasets they compare against.

4.1 Synthetic data from rigged assets

For synthetic data, the paper uses rigged human assets and animates them with poses sampled from AMASS. In the supplementary material, the authors specify that they use all 1,000 publicly released SynBody characters, spanning diverse body shapes, skin tones, and about 68 clothing templates, including dresses, T-shirts, coats, and pants. The generated synthetic subset contains 78k assets. On average, about 26 views are rendered per asset from camera positions randomly distributed on a sphere, with azimuth in $[0^\circ, 360^\circ]$ and elevation in $[-45^\circ, 60^\circ]$.

The synthetic pipeline is straightforward: sample SMPL-X parameters from AMASS, animate the rigged asset, re-center the animated human, and render multiple canonical views. This strategy produces broad pose variation and helps the model learn diverse clothing configurations while retaining geometry consistency.

4.2 Real-world data from multi-camera capture

The real-world subset is built from multi-camera human capture datasets such as DNA-Rendering and MVHumanNet. The authors fit a 3D Gaussian Splatting representation to the captured subject, initializing one Gaussian per mesh vertex using the subject’s SMPL-X mesh. The optimization is driven by a photometric loss over the captured views, with adaptive density control to improve convergence and coverage. After fitting, they re-render the fitted subject from canonical viewpoints to obtain additional supervision images.

In the supplementary material, they report that this subset contains 22k assets and that the Gaussian optimization only lasts 4,000 iterations, with densification performed between iterations 400 and 1,500. They also report a quantitative self-check on the re-rendered fitted data of average $36.23 / 0.9881 / 16.57$ for PSNR / SSIM / LPIPS, suggesting the fitted data remains high quality.

A notable implementation detail is that the SMPL-X-based initialization is far better than standard COLMAP initialization for this setting. On a 10-sample test, the paper reports PSNR / SSIM / LPIPS of $36.38 / 0.9886 / 16.36$ for the proposed initialization versus $16.49 / 0.9025 / 68.79$ for COLMAP. The authors also state that the improved initialization reduces optimization time from about 40 minutes to about 4 minutes.

Visualization of our training data. (Best viewed in color.) The first three rows correspond to real-world generated data, while the remaining rows are generated synthetic data.
Visualization of our training data. The first three rows correspond to real-world generated data, while the remaining rows are generated synthetic data.

4.3 Why both synthetic and real data matter

The paper repeatedly shows that the two data sources are complementary. Synthetic assets contribute pose diversity and large scale, while real-world captures provide realistic appearance statistics and details that are hard to synthesize. Ablations show that removing either component hurts performance, and increasing the proportion of generated data improves results consistently.

5. Training Setup and Evaluation Protocol

HumanNOVA is implemented in PyTorch and trained on 64 NVIDIA H100 GPUs using AdamW with learning rate $6 \times 10^{-4}$ and batch size 64. The triplane spatial size is 96, and the input resolution is $512 \times 512$. During training, the model supervises 4 rendered views per instance. Patches of size $180 \times 180$ are used for loss computation, selected according to foreground coverage so that supervision focuses on the human body rather than background.

For the human-scan datasets THuman2, CustomHuman, and 2K2K, the paper follows a unified preprocessing protocol: each mesh is placed in a canonical camera setup and rendered into 36 multiview images at 10-degree intervals around a full horizontal circle. Importantly, supervision uses rendered images rather than mesh geometry directly.

Evaluation is performed on CustomHuman, THuman2, and 2K2K. The main image metrics are PSNR, SSIM, and LPIPS. For geometry, the paper reports Chamfer Distance (CD), Normal Consistency (NC), and F-Score. For fair comparison with mesh-based methods, the authors align the predicted meshes to the ground truth using scale alignment and ICP before rendering and geometry evaluation.

6. Main Quantitative Results

The paper compares HumanNOVA against both human-specific methods and general 3D reconstruction methods. The baselines include Real3D, SF3D, Trellis, Hunyuan2, PaMIR, SiFU, and SiTH for rendering quality, and additional mesh/shape methods for geometry evaluation. The central result is that HumanNOVA is consistently best on all reported benchmarks, under both frontal-view and side-view inputs.

Main rendering-quality comparison on CustomHuman, THuman2, and 2K2K.
Method CustomHuman THuman2 2K2K
PSNRSSIMLPIPS PSNRSSIMLPIPS PSNRSSIMLPIPS
Real3D17.130.899095.1219.140.909487.6818.060.902081.78
SF3D19.460.911366.0922.280.928757.2020.470.914258.14
Trellis18.590.912374.9820.770.921865.6719.210.914068.25
Hunyuan219.420.909474.3421.440.925766.1919.870.914565.62
PaMIR18.150.907088.1221.030.922970.9118.890.911373.90
SiFU17.940.909185.7519.440.915779.6216.820.903987.51
SiTH19.130.917372.9420.920.923166.9018.490.909573.55
HumanNOVA22.290.936042.4223.960.938242.1322.650.933641.72

The paper highlights that, relative to the best competitor SiTH, HumanNOVA achieves LPIPS improvements of 41.8% on CustomHuman, 37.0% on THuman2, and 43.3% on 2K2K in the frontal-view setting. The authors emphasize that the gains come from both stronger human priors and the much larger training dataset.

Additional geometry-quality comparison on CustomHuman, THuman2, and 2K2K.
Method CustomHuman THuman2 2K2K
CDNCF-Score CDNCF-Score CDNCF-Score
SF3D1.738/2.0400.84739.5851.441/1.7450.83343.8201.204/1.4120.82950.900
Trellis2.125/2.1750.80132.8461.799/1.8320.79637.9391.446/1.3590.80548.826
Hunyuan21.799/1.7620.83738.3651.562/1.5410.80843.8681.237/1.2170.82953.946
ICON2.468/2.9150.77927.7312.568/3.1680.75226.4532.211/3.3310.72828.805
ECON2.160/2.8130.80433.4292.240/3.9310.76331.2942.066/6.2320.73232.927
SiFU2.440/3.2030.78927.5532.509/3.7780.76027.4872.136/5.3310.73229.823
SiTH1.792/2.2150.82636.8221.741/2.0820.80539.6661.518/1.8960.79842.859
HumanNOVA1.062/1.1020.86761.3791.027/1.0980.84061.9391.045/1.1100.83660.673

The geometry results are especially strong. The paper reports that HumanNOVA attains the best CD, NC, and F-Score on all three benchmarks, and on side-view inputs it achieves a 94.3% relative F-Score gain over SiTH on CustomHuman.

7. Ablation Studies and What They Show

The ablations are useful because they separate the contributions of the data pipeline from the model design. Unless otherwise stated, these ablations are run for 70% of the main training iterations and evaluated on CustomHuman using frontal-view input.

7.1 Data ablations

Ablation on the generated data type and scale on CustomHuman.
SettingPSNRSSIMLPIPS
w/o gen-data (assets)21.840.933346.51
w/o gen-data (multi-cam)21.760.932647.83
25% generated data21.980.931350.14
50% generated data22.020.933847.03
HumanNOVA22.070.934445.18

These experiments show that both synthetic assets and multi-camera real data are important. More generated data consistently improves results, and the synthetic component is particularly helpful for pose diversity, while the real component helps photorealism.

The paper also verifies the utility of the generated dataset by fine-tuning Real3D on it. Across CustomHuman, THuman2, and 2K2K, Real3D improves substantially both for frontal and side-view input. This supports the claim that the dataset itself is a meaningful contribution, not just a training trick for the proposed architecture.

Effectiveness of the generated data when used to fine-tune Real3D.
Method CustomHuman THuman2 2K2K
PSNRSSIMLPIPS PSNRSSIMLPIPS PSNRSSIMLPIPS
Real3D17.130.899095.1219.140.909487.6818.060.902081.78
Real3D + our data20.970.926858.5423.100.932555.3020.910.920258.22

7.2 Model ablations

Ablation on model settings on CustomHuman.
SettingPSNRSSIMLPIPS
w/o mesh prior21.890.933446.26
small triplane size (32)21.780.932348.33
HumanNOVA22.070.934445.18

The mesh prior improves LPIPS by about 2.3% relative, showing that coarse body structure remains valuable even with large-scale training. Reducing the triplane spatial size from 96 to 32 causes a noticeable drop, indicating that human reconstruction benefits from a higher-capacity 3D latent than what is often sufficient for general object LRMs.

Additional ablations from the supplementary material.
SettingPSNRSSIMLPIPS
Visual encoder: DINOv2 -> Sapiens21.980.932746.52
Fusion layers: 4 -> 221.420.930150.65
Supervision views: 4 -> 222.070.934445.18

The most important takeaway from these extra ablations is that sufficient cross-modal fusion depth matters. Reducing the number of fusion layers hurts performance the most, especially LPIPS. Using fewer supervision views also degrades quality, showing the value of richer multiview supervision during training.

Visual results on the effectiveness of the SMPL prior.
Visual results on the effectiveness of the SMPL prior.
Visual results on the robustness of HumanNOVA under inaccurate SMPL estimates.
Visual results on the robustness of HumanNOVA under inaccurate SMPL estimates.

8. Qualitative Findings and Failure Modes

The qualitative comparisons reinforce the quantitative trends. HumanNOVA reconstructs sharper clothing boundaries, more plausible body structure, and more stable appearance across viewpoints than the baselines. This is shown both on benchmark images and on in-the-wild images with more challenging backgrounds and poses.

Qualitative comparison with state-of-the-art methods, including input from benchmarks (top) and in-the-wild images (bottom). The reconstructed human by our method shows superior structure and appearance. (Best viewed in color.)
Qualitative comparison with state-of-the-art methods, including input from benchmarks (top) and in-the-wild images (bottom). The reconstructed human by our method shows superior structure and appearance. (Best viewed in color.)
Qualitative comparison with state-of-the-art methods, including input from benchmarks (top) and in-the-wild images (bottom). The reconstructed human by our method shows superior structure and appearance. (Best viewed in color.)
Qualitative comparison with state-of-the-art methods, including input from benchmarks (top) and in-the-wild images (bottom). The reconstructed human by our method shows superior structure and appearance. (Best viewed in color.)
Qualitative evaluation of our approach with in-the-wild images as input. We also show some typical failure cases (bottom), e.g., inferring the plausible back texture of challenging clothes like dresses and overalls. (Best viewed in color.)
Qualitative evaluation of our approach with in-the-wild images as input. We also show some typical failure cases (bottom), e.g., inferring the plausible back texture of challenging clothes like dresses and overalls. (Best viewed in color.)

The failure cases are informative: the model can struggle to infer plausible back-side textures for especially ambiguous garments, such as dresses and overalls. The supplementary material also notes that extremely inaccurate SMPL estimates can still cause failures, even though the system is generally robust and tends to prioritize appearance cues when the mesh prior is noisy.

9. Supplementary Analysis

The supplementary section adds three useful points. First, it compares HumanNOVA with animation-based methods such as SHERF and LHM, which target a related but different setting: animatable avatar reconstruction that depends heavily on accurate pose alignment. HumanNOVA is more challenging because it must work from only a single image and off-the-shelf SMPL estimation. On the reported comparison, HumanNOVA substantially outperforms both methods, with PSNR / SSIM / LPIPS of $22.29 / 0.9360 / 42.42$ versus SHERF’s $16.83 / 0.9037 / 87.99$ and LHM’s $17.75 / 0.9083 / 76.85$.

Second, a leave-one-out experiment on CustomHuman shows that the model generalizes well to unseen settings. The leave-one-out result is $21.99 / 0.9344 / 44.22$, close to the full model’s $22.29 / 0.9360 / 42.42$.

Third, the authors discuss the Janus problem in their video results and argue that HumanNOVA is less affected because it models the full 3D human directly rather than separately generating front and back views and merging them heuristically.

10. Limitations, Broader Impact, and Practical Takeaways

The paper is candid about remaining limitations. HumanNOVA still struggles when the input image contains severe occlusion, when the back side of the clothing is highly ambiguous, or when the estimated SMPL mesh is completely wrong. The authors also note that extending the method to human-human or human-object interactions is a promising direction for future work.

The supplementary broader-impact discussion flags misuse concerns. Because the method lowers the barrier to producing realistic 3D human reconstructions, it could potentially be used for deceptive content, harassment, or privacy violations. The paper therefore calls for both technical safeguards and regulatory attention.

The practical takeaway is that the paper’s gains are driven by a clear recipe: scale human data aggressively, combine synthetic pose diversity with real capture realism, and adapt a large reconstruction backbone with a human-specific mesh prior and a sufficiently expressive triplane. Within that design space, HumanNOVA is presented as a strong proof that feed-forward LRM-style models can work well for humans, not just for generic objects.

11. Summary of Contributions

  • A feed-forward single-image human avatar model that reconstructs photorealistic 3D humans in under one second, with no test-time optimization.
  • A scalable data pipeline that combines synthetic rigged assets animated with AMASS poses and real multi-camera captures fitted with 3D Gaussian Splatting.
  • A token-conditioned transformer architecture that fuses DINOv2 image tokens and PTv3 mesh tokens into a triplane representation.
  • Strong quantitative improvements over prior human-specific and general reconstruction methods on rendering and geometry metrics across CustomHuman, THuman2, and 2K2K.
  • Ablations showing that both data sources, the mesh prior, sufficient triplane capacity, and adequate fusion depth are important for the final quality.