Details about my thesis, "Expressive 3D Human Motion Generation: A Pipeline for Emotion-Conditioned Synthesis with a Novel Dataset", made during the Artificial Intelligence master program from the Faculty of Mathematics and Computer Science, University of Bucharest. It was presented in the year 2025.
The thesis presents a pipeline for generating expressive and comprehensive 3D behavioral human motions conditioned on emotion, targeting digital domains such as video games, virtual reality, and animation. It addresses the limitations of current emotion-based motion synthesis, which often relies on non-whole-body or in-the-wild datasets, or expensive labor-intensive traditional motion capture (mocap) methods. By leveraging a novel whole-body, not in-the-wild, behavioral video dataset, and state-of-the-art deep learning and computer vision techniques, the pipeline successfully generates 3D motions that are faithful to human anatomy and the specified emotions, all without the need for motion capture technology.
The proposed dataset has been uploaded at this Hugging Face link.
The pipeline used was the following:
| Category | Technology | Purpose in Pipeline |
|---|---|---|
| Generative Model | MDM (transformer encoder-only backbone) | Trained from scratch for Action-to-Motion (A2M) synthesis to generate expressive emotional motions based on the features extracted by HybrIK-X. |
| Mesh Recovery | HybrIK-X (HRNet-W48 + RLE backbone) | Pretrained state-of-the-art inverse kinematics framework used for fast and accurate whole-body mesh recovery in order to perform feature extraction on the novel dataset. |
| Human Body Model | SMPL-X | Parametric 3D human body model used by HybrIK-X to represent the human body along with its shapes and poses. |
The extraction of features and training of the generative model were done using a remotely controlled NVIDIA H100 80GB HBM3 GPU owned by the university. Attempts were made in the beginning to do these on the local setup, featuring an NVIDIA RTX 2060 Mobile 6GB, but the time required was significantly larger than on the H100 GPU. (for instance, on the local setup, up to even 10 minutes were needed to extract features from a single video, and around 4 minutes for a single training epoch out of tens of thousands)
The thesis makes three main contributions to the field of 3D human motion synthesis:
- Novel Dataset: Introduction of a new, curated dataset containing videos of people expressing the six fundamental emotions: anger, disgust, fear, happiness, sadness, surprise. The dataset is characterized by expressive whole-body features and adherence to emotional identification guidelines (proposed by psychology professor Paul Ekman).
- Whole-Body Feature Extraction: Successful extraction of whole-body meshes and their corresponding features: joints and joint rotations; from the proposed dataset using a pretrained version of the HybrIK-X method.
- Action-to-Motion (A2M) Synthesis: Training from scratch of the MDM model on the features extracted from the novel dataset for the task of A2M synthesis, and its inference, where a simple emotion label (e.g., "anger") generates a fitting 3D motion.
Additional dataset details:
- Size: 1,048 samples. It contains roughly the same amount of samples in each of the six emotion categories.
- Recording standard: The videos were captured with a single-view, frontal camera setup. They feature clutter-free backgrounds, with the person's full body visible throughout, and movements starting and ending in a neutral position to ensure clean data for the AI systems. Each video is in RGB format and lasts up to roughly 8 seconds.
- Cleaning: The main cleaning methods were the removal of videos in which there was not exactly one dominant emotion expressed belonging to the six chosen emotions, and videos in which the duration of the start/end neutral pose was too short or non-existent. Before this step, there were 1,728 videos in the dataset.
- Preprocessing: This was done so that subsequent pipeline stages would take less time. It consisted of using the FFmpeg multimedia tool to reduce the frame rate to 20 FPS and set the resolution of videos with 3840 x 2160 pixels to 1920 x 1080 pixels.
- Extracted features: Features were extracted based on the SMPL-X body model. For the scope of this thesis, body-only features (24 joints and 24 joint rotations) were used, following the same kinematic chain as in the SMPL body model, but all features were preserved (body, hand, facial; 127 joints and 55 joint rotations) for potential future whole-body motion generation work.
- Contribution: 30 students (including myself) from the faculty along with a coordinating teacher helped with its realization. All students added samples, while only me and another colleague, Monica-Andreea Gîrbea, came with ideas regarding the conception of the dataset, and further cleaned the samples gathered by everyone. The coordinating teacher guided the dataset's development, proposing details about the recording standard, dataset size, and emotion categories.
In the authors' code for training the MDM model, at the part where evaluation metrics were computed, they used RNN models pretrained on a different dataset than the proposed one in order to extract features from generated motions and classify them. As such, performing model selection proved difficult, because the results of the evaluation metrics were not representative of the model's actual performance. Moreover, their code did not allow for the computation of a validation loss, which further added to the difficulty of model selection. Consequently, model selection was done solely based on training trends: the convergence of the training loss and the behavior of parameter and gradient norms.
Training trends:
The checkpoint at the 250,000th training step was selected as a balance between stable training, effective training, and mitigating overfitting.
The motions generated by the trained MDM model show a high degree of expresiveness and anatomical correctness. In addition, when inputting any action, the model is able to generate a motion appropriate for that action, proving that it knows how to distinguish between action categories. On the other hand, some motions present slight collision issues or ambiguity given by the lack of hand and facial estimation. Moreover, the generated motions are very similar to dataset motions, hinting that overfitting is likely present. It was tried to address this, but due to the authors' code not being able to compute relevant evaluation metrics and a validation loss, and due to time constraints, the results were left as they were.
Example motion expressing sadness (visual representation was coded by me; joints and joint rotations were generated by the model):
sadness.mp4
As previously mentioned, MDM could not compute metrics indicative of the actual performance of the model when trained on the proposed dataset. However, these results were still kept, for the possibility of future exploratory analysis.
Despite the potential presence of overfitting, the benefits of MDM trained on the proposed dataset are still significant. For instance, it can be treated as an ”indexed motion library” featuring fast lookup and the ability to be plugged in the desired application for the generation of random realistic motions, while having a data footprint of just under 100 megabytes, as opposed to storing gigabytes of raw clips and having to sift through those. In addition, the .obj SMPL files generated by the model are supported by popular 3D graphics tools such as Blender and Maya.
Future directions include implementing or waiting for the apparition of a method capable of whole-body motion generation, combining HybrIK-X with an even more accurate mesh recovery method, such as SMPLify-X, for the fast recovery of even more expressive meshes, and expanding the dataset with even more diverse motions, or with new difficult emotion categories, such as more ambiguous or subtle ones, like anxiety or contempt, adding at least 200,000 more frames.
The lack of available methods for the generation of whole-body motions, which restricted the scope of the thesis to body-only generation.
Minimal size of the proposed dataset, which made the generative model more prone to overfitting.
Dependence of the MDM model on RNN models that had to be pretrained on the training dataset, which made it difficult to perform model evaluation.
The parts of this work which belong to me are licensed under a Creative Commons Attribution 4.0 International License.