One-shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing

1School of Computer Science and Technology, Guangdong University of Technology 2University of Oxford 3College of Computing and Data Science, Nanyang Technological University
yuzhu.ji a/t gdut.edu.cn

Broad view of our approach and supplementary video. We propose a unified framework for human motion transfer allows for good fidelity in 2D appearance transfer by estimating 2D motion flow, while establishing pose accuracy through 2.5D geometric reasoning. We highlight our strengths for motion transfer in three aspects: 1) Recovering correct geometry and details. For results from TED-Talks, the arm and hand pose of the source person are correctly transferred. 2) Handling back-to-front flips and heavy self-occlusion. Our model can better handle front-to-back view motion transfer in the TaiChiHD dataset, producing correct geometry despite substantial self-occlusion. 3) Preserving video animation stability and consistency. The examples from the iPER dataset show driving sequences with turning-around motion and large variation between source and driving images.

Abstract

Human motion transfer aims at animating a static source image with a driving video. While recent advances in one-shot human motion transfer have led to significant improvement in results, it remains challenging for methods with 2D body landmarks, skeleton and semantic mask to accurately capture correspondences between source and driving poses due to the large variation in motion and articulation complexity. In addition, the accuracy and precision of DensePose degrade the image quality for neural-rendering-based methods. To address the limitations and by both considering the importance of appearance and geometry for motion transfer, in this work, we proposed a unified framework that combines multi-scale feature warping and neural texture mapping to recover better 2D appearance and 2.5D geometry, partly by exploiting the information from DensePose, yet adapting to its inherent limited accuracy. Our model takes advantage of multiple modalities by jointly training and fusing them, which allows it to robust neural texture features that cope with geometric errors as well as multi-scale dense motion flow that better preserves appearance. Experimental results with full and half-view body video datasets demonstrate that our model can generalize well and achieve competitive results, and that it is particularly effective in handling challenging cases such as those with substantial self-occlusions.

Pipeline

MY ALT TEXT Overview of our proposed pipeline: The MotionNet (a) produces the dense motion flow and translation signals for the appearance and geometry translation branches. We combine multi-scale feature warping (b) and neural texture mapping (c) into a unified framework for motion transfer. The translated appearance and geometry features are integrated by BlenderNet(d) for image refinement.

Results

MY ALT TEXT Qualitative results in comparisons with state-of-the-art: We show the results from TED-Talks (the first four sets), TaichiHD (the forth and fifth sets), and iPER dataset (the last two sets). It illustrates our model can animate both half and full human body images with (1) better geometry and details (see the first and second examples), (2) large variations in pose for front-to-back view and self-occlusion ({see the fifth and seventh examples}), (3) preserving the stability and consistency of animated video sequence.

BibTeX

@article{ji2024one,
        title={One-shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing}, 
        author={Yuzhu Ji and Chuanxia Zheng and Tat-Jen Cham},
        year={2024},
        eprint={2412.06174},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.06174}, 
  }