Source code for: Expressive Speech-driven Facial Animation with controllable emotions
Supplementary Video: https://1drv.ms/v/s!AomgXFHJMxuGlDrVh5mwV1z_FFYv?e=bBcc4C
This repository is under construction now
Update:
- initial update(Done, 6.18)
- rewrite framework (Done, 6.30)
- test inference module (Done, 6.30)
- test training module (Done, 7.1)
- check fitting algorithms
- rewrite schduler
- add deployment scripts
If you want to train our model, you need to download CREMA-D, LRS2 and VOCASET dataset. Put dataset folders under datasets as following structure:
datasets
CREMA-D
AudioWAV
VideoFLash
...
LRS2
mvlrs_v1
train.txt
val.txt
...
VOCASET
audio
FaceTalk_XXXX_XXXX_TA
...
To get FLAME code dict for each sample, I use EMOCA to fit 2D datasets(CREMA-D and LRS2) and another fitting algorithm to fit 3D datasets(VOCASET). Uncomment the line under make_dataset_cache to fit 2D datasets(fitting algorithm for VOCASET is not available now). Results can be viewed in datasets/cache folder.
I use several 3rd packages for training and inference. All 3rd packages are under third_party folder. I provide a refactored version for EMOCA, which is called EMOCABasic. EMOCABasic rewrites the model interface, deleting unnecessary code for this model. But the checkpoint files and model weights are unchanged.
For training, wav2vec2, DAN and EMOCABasic are necessary. For inference(running model in inference mode), wav2vec2 and EMOCABasic are needed. For baseline test, additional packages Faceformer and VOCA are needed.
File structure under third_party folder:
third_party
DAN
EMOCABasic
Faceformer
VOCA
DAN
wav2vec2
training code is not available now
- prepare datasets and 3rd packages following above instructions.
- run fitting algorithm for VOCASET
- run dataset caching scripts(uncomment the line under
make_dataset_cacheinmain.sh) - adjust
deviceinconfig/global.yml - adjust
train_minibatchinconfig/trainer.yml, changemodel_nameif you want to train another model. - back to project folder, run
python -u main.py --mode train
Read this before running inference
I uploaded a fusion mode in inference scripts. It is only available in aud-cls=XXX (HAP, DIS, FEA .etc) mode (see inference configurations section) now. This method is able to disentangle lip movement and expression in output. It is actually a trick but I found it works well. Since lip movement and expression can be separately generated, the design of origin model is out of date. I will only keep inference mode available before I update the model. If you want to see the output of fusion mode, set sample configuration to aud-cls=FEA (or other labels), the rightmost is fusion output.
- download model weights and 3rd models from this link: https://1drv.ms/f/s!AomgXFHJMxuGk2jrH5UqrbYe5mTY?e=8VG12t
- put
date-Nov-/.../.pthto tf_emo_4/saved_model emoca_basic.ckptis the same as original EMOCA model weights (just name changed). Check MD5: 06a8d6bf2d9373ac2280a1bc7cf1acb4- make sure that all files in
config/global.ymlare under correct paths - input files for inference should be placed at
inference/input - change sample configs in
config/inference.yml/infer_dataset, add custom input files and inference configs for each input. - back to project folder, run
python -u inference.py --mode inference - get inference output at
inference/output
Collect Function:
wav: audio data. tensor, shape: (batch, wav_len)seqs_len: video frame length for sequence, LongTensor, shape: (batch,)params: concat(expression code, pose code), tensor, shape: (batch, seq_len)emo_logits: DAN output, tensor, shape: (batch, 7)code_dict: dict contains FLAME codeshapecode: shape code. tensor, (batch, 100)expcode: expression code. tensor, (batch, 50)posecode: pose(head rotation, jaw angle), tensor, (batch, 6)lightcode: code for light rendering, usedefault_light_codecam: camera position, (batch, 3)texcode: texture code for rendering, generated from EMOCA encoder.
EMOCABasic decoding results:
verts: decoded mesh sequence by EMOCA based oncode_dict, (batch, 5663)output_images_coarseandpredicted_images: rendered images (grey or with texture)geometry_coarseandemoca_imgs: rendered grey face images without lighttrans_verts: temporal data for decoding processmasks:Falsefor background area in rendered images
Model input configurations:
name: sample names, listsmooth: enable output smoothing, boolemo_label: emotion control vector. Used to add emotion intensity in sequence level.intensity: custom intensity for emotion inemo_labelemo_logits_conf: str, control model behavior:use: predict emotion from autio without any changeno_use: generate model output without emotionone_hot: adjust emotion based onemo_labelandintensity
Datasets:
imgs: original cropped image from 2D datasetspath: fitting output path for VOCASET samplewav_path: wav path for VOCASET sampleflame_template: template path for each sample in Baseline VOCASETverts: origin vertex data for each sample in Baseline VOCASET
inference.yml/sample_configs provides different configs to control model behavior. Some tags can be used independently, such as video, audio, emo-ist, emo-cls. Other tags should be used under specific situations, such as -tex, =HAP
video: result order: [original video, EMOCA decoding output,emo_logits_conf=use, speech driven(predict emotion from audio), no emotion]audio: result order: [speech driven, no emotion]emo-cls: generate video with specific emotions: ['NEU','ANG','HAP','SAD','DIS','FEA']aud-cls=XXX: result order: [model output with emotion X enhancement, faceformer output, no emotion output]emo-ist: generate video with varying emotions and intensities, seeemo_istinsample_configs
test loss in VOCASET
| Models | Max loss(mm) | average loss(mm) |
|---|---|---|
| random init | 3.87 | 2.31 |
| faceformer | 3.33 | 1.97 |
| voca | 3.41 | 1.94 |
| tf_emo_4(mouth loss coeff=0.5) | 3.24 | 1.92 |
| tf_emo_5(jaw loss coeff=0) | 3.22 | 1.92 |
| tf_emo_8(few mask+LRS2) | 3.36 | 2.01 |
| (conf A)tf_emo_2(mouth loss coeff=0) | 3.39 | 2.01 |
| (conf B)tf_emo_6(add params mask) | 3.29 | 1.97 |
| (conf C)tf_emo_3(reduce noise) | 3.32 | 1.99 |
| (conf D)tf_emo_7(add params mask and introduces LRS2 dataset) | 3.34 | 1.99 |
| (conf E)use only vocaset | 3.30 | 1.97 |
| (conf F)tf_emo_10(disable transformer encoder) | 3.38 | 2.03 |