Abstract
Although remarkable progress has been achieved in video generation, synthesizing physically plausible human actions remains an unresolved challenge, especially when addressing fine-grained semantics and complex temporal dynamics. For instance, generating gymnastics routines such as "two turns on one leg with the free leg optionally below horizontal" poses substantial difficulties for current video generation methods, which often fail to produce satisfactory results. To address this, we propose FinePhys, a Fine-grained human action generation framework incorporating Physics for effective skeletal guidance. Specifically, FinePhys first performs online 2D pose estimation and then accomplishes dimension lifting through in-context learning. Recognizing that such data-driven 3D pose estimations may lack stability and interpretability, we incorporate a physics-based module that re-estimates motion dynamics using Euler-Lagrange equations, calculating joint accelerations bidirectionally across the temporal dimension. The physically predicted 3D poses are then fused with data-driven poses to provide multi-scale 2D heatmap-based guidance for the video generation process. Evaluated on three fine-grained action subsets from FineGym (FX-JUMP, FX-TURN, and FX-SALTO), FinePhys significantly outperforms competitive baselines. Comprehensive qualitative results further demonstrate FinePhys's ability to generate more natural and plausible fine-grained human actions.
Demo Video
Framework
Overview of FinePhys. FinePhys addresses the challenging task of generating fine-grained human action videos by explicitly incorporating physical equations exploiting pose modality. The pipeline begins with online extracting 2D poses, then transforms them into 3D using an in-context learning module, achieving the data-driven 3D skeleton sequence $S^{3D}_{dd}$. To incorporate the physical laws of motion, we introduce a Phys-Net module to re-estimate the 3D positions of each human joint by accounting for second-order temporal variations (i.e., accelerations) in both forward and reverse directions, yielding physically predicted 3D poses $S^{3D}_{pp}$. Subsequently, $S^{3D}_{dd}$ and $S^{3D}_{pp}$ are fused, projected back into 2D space, encoded into multi-scale latent maps, and integrated into 3D-UNets to guide the denoising process.
We borrow the source code of this project page from CameraCtrl.