ACE-Step fork

Progress

Separate data preprocessing (music and text encoding) and training
Enable gradient checkpointing
Cast everything to bf16

Now I can run the training on a single RTX 3080 with < 10 GB VRAM and 0.3 it/s speed, using music duration < 360 seconds and LoRA rank = 64.

I've trained some LoRAs at https://huggingface.co/woctordho/ACE-Step-v1-LoRA-collection

Usage

Collect some audios, for example, in the directory C:\data\audio.
Generate prompts using Qwen2.5-Omni-7B:
```
python generate_prompts_lyrics.py --data_dir C:\data\audio
```
Each prompt is a list of tags separated by comma space , without EOL. The order of tags will be randomly shuffled in the training. (TODO: Check how natural language prompts affect the performance.)

(Experimental) The above script uses gptqmodel. Alternatively, you can use llama.cpp:
Expand

Start llama-server (by default it listens host 127.0.0.1, port 8080)
```
llama-server -m Qwen2.5-Omni-7B-Q8_0.gguf --mmproj mmproj-Qwen2.5-Omni-7B-Q8_0.gguf -c 32768 -fa -ngl 999 --cache-reuse 256
```
Then run
```
python generate_prompts_lyrics_llamacpp.py --data_dir C:\data\audio
```
After this step, you can shut down llama-server to save VRAM.

Unfortunately, for now llama.cpp did not reproduce the original model with enough accuracy, so tags may not be accurate and lyrics almost does not work at all.
(Experimental) You can also generate lyrics:
Expand
```
python generate_prompts_lyrics.py --data_dir C:\data\audio --lyrics
```
It seems Qwen2.5-Omni-7B works well for Chinese lyrics, but not so well for English and other languages.
Besides using an AI model to transcribe lyrics, you can also extract lyrics embedded in the audio file, or query online databases such as 163MusicLyrics, LyricsGenius, LyricWiki. You may try ace-data_tool.

For music without vocal, just use [instrumental] for the lyrics.

At this point, the directory C:\data\audio should be like:
```
audio1.wav
audio1_lyrics.txt
audio1_prompt.txt
audio2.mp3
audio2_lyrics.txt
audio2_prompt.txt
...
```

Create a dataset that only contains the filenames, not the audio data:

python convert2hf_dataset_new.py --data_dir C:\data\audio --output_name C:\data\audio_filenames

Load the audios, do the preprocessing, save to a new dataset:
```
python preprocess_dataset_new.py --input_name C:\data\audio_filenames --output_dir C:\data\audio_prep
```
The preprocessed dataset takes ~0.2 MB for every second of input audio.

TODO: If you have a lot of training data and want to reduce disk space requirement, we can add a switch to move MERT and mHuBERT from preprocessing to training.
Do the training:
```
python trainer_new.py --dataset_path C:\data\audio_prep
```
The LoRA will be saved to the directory checkpoints. Make sure to clear this directory before training, otherwise the LoRA may not be correctly saved.

If you have a lot of VRAM, you can remove self.transformer.enable_gradient_checkpointing() for faster training speed.

My script uses Wandb rather than TensorBoard. If you don't need it, you can remove the WandbLogger.
LoRA strength:

At this point, when loading the LoRA in ComfyUI, you need to set the LoRA strength to alpha / sqrt(rank) (for rsLoRA) or alpha / rank (for non-rsLoRA). For example, if rank = 64, alpha = 1, rsLoRA is enabled, then the LoRA strength should be 1 / sqrt(64) = 0.125.

To avoid manually setting this, you can run:
```
python add_alpha_in_lora.py --input_name checkpoints/epoch=0-step=100_lora/pytorch_lora_weights.safetensors --output_name out.safetensors --lora_config_path config/lora_config_transformer_only.json
```
Then load out.safetensors in ComfyUI and set the LoRA strength to 1.

Tips

If you don't have experience, you can first try to train with a single audio and make sure that it can be overfitted. This is a sanity check of the training pipeline
You can freeze the lyrics decoder and only train the transformer using config/lora_config_transformer_only.json. I think training the lyrics decoder is needed only when adding a new language

In the LoRA config, you can add

"projectors.0.0",
"projectors.0.2",
"projectors.0.4",
"projectors.1.0",
"projectors.1.2",
"projectors.1.4",

to target_modules. This may help the model learn the music style

When using an Adam-like optimizer (including AdamW and Prodigy), you should not let 1 - beta2 be much smaller than 1 / max_steps
When using Prodigy optimizer, make sure that d rises to a large value (such as 1e-4, should be much larger than the initial 1e-6) after 1 / (1 - beta2) steps
After training, you can prune the LoRA using SVD, such as resize_lora.py in Kohya's sd-scripts. If the dynamic pruning tells you that the LoRA rank can be much smaller without changing the output quality, then next time you can train the LoRA using a smaller rank

TODO

Support batch size > 1, maybe bucketing samples with similar lengths
How to normalize the audio loudness before preprocessing? It seems the audios generated by ACE-Step usually have loudness in -16 .. -12 LUFS, and they don't follow prompts like 'loud' and 'quiet'
To generate the tags, maybe a specialized tagger can perform better than Qwen2.5-Omni-7B, such as OpenJMLA, GLAP, MuFun
- The statistics of the tags used to train the base model is shared on Discord
When an audio is cropped because it's too long, also crop the lyrics
I would not include BPM in the AI-generated tags, because it's much more accurate to detect BPM using traditional methods than AI. Also, to control the BPM of the generated audio, I guess it's more adhesive to use a control net than the prompt, similar to the Canny control net for images.
Use prodigy-plus-schedule-free

Name		Name	Last commit message	Last commit date
Latest commit History 312 Commits
acestep		acestep
config		config
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_alpha_in_lora.py		add_alpha_in_lora.py
convert2hf_dataset_new.py		convert2hf_dataset_new.py
generate_prompts_lyrics.py		generate_prompts_lyrics.py
generate_prompts_lyrics_3b.py		generate_prompts_lyrics_3b.py
generate_prompts_lyrics_llamacpp.py		generate_prompts_lyrics_llamacpp.py
modeling_qwen2_5_omni_low_VRAM_mode.py		modeling_qwen2_5_omni_low_VRAM_mode.py
preprocess_dataset_new.py		preprocess_dataset_new.py
requirements.txt		requirements.txt
trainer_new.py		trainer_new.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ACE-Step fork

Progress

Usage

Tips

TODO

About

Uh oh!

Languages

License

woct0rdho/ACE-Step

Folders and files

Latest commit

History

Repository files navigation

ACE-Step fork

Progress

Usage

Tips

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages