AFTER is a diffusion-based generative model that creates new audio by blending two sources: one audio stream to set the style or timbre, and another input (either audio or MIDI) to shape the structure over time.
This repository is a real-time implementation of the research paper Combining audio control and style transfer using latent diffusion (read it here) by Nils Demerlé, P. Esling, G. Doras, and D. Genova. Some transfer examples can be found on the project webpage. This real-time version integrates with MaxMSP and Ableton Live through nn_tilde, an external that embeds PyTorch models into MaxMSP.
You can find pretrained models and max patches for realtime inference in the last section of this page.
git clone https://github.com/acids-ircam/AFTER.git
cd AFTER/
pip install -e .If you want to use the model in MaxMSP or PureData for real-time generation, please refer to the nn_tilde external documentation and follow the installation steps.
Training AFTER involves 3 separate steps, autoencoder training, model training and model export.
If you already have a streamable audio codec such as a pretrained RAVE model, you can directly skip to the next section. Also, we provide four audio codecs already trained on different datasets here.
Before training the autoencoder, you need to preprocess your audio files into an lmdb database :
after prepare_dataset --input_path /audio/folder --output_path /dataset/path_audio --save_waveform True Then, you can start the model training
after train_autoencoder --name AE_model_name --db_path /dataset/path_audio --config baseAE --gpu 0where db_path refers to the prepared dataset location. The tensorboard logs and checkpoints are saved by default to ./autoencoder_runs/.
After training, the model has to be exported to a torchscript file using
after export_autoencoder --model_path autoencoder_runs/AE_model_name This will save two .ts files in the run folder, one for streaming and one for offline inference (export_stream.ts and export.ts respectively). By default the last training step checkpoint is used for export.
First, you need to prepare your dataset before training. Since our diffusion model works in the latent space of the autoencoder, we pre-compute the latent embeddings to speed up training :
after prepare_dataset --input_path /audio/folder --output_path /dataset/path_latent_codes --emb_model_path AE_model_run_path/export.tsnum_signalflag sets the duration of the audio chunks for training in number of samples (must be a power of 2). (default: 524288 ~ 11 seconds)sample_rateflag sets the resampling rate. (default: 44100)gpudevice to use for computing the embeddings. Use -1 for cpu (default: 0)
To train a midi-to-audio AFTER model you need to either use the flag --basic_pitch_midi to transcript the midi from the audio files or define your own file parsing function in ./after/dataset/parsers.py.
If you plan to have more advanced use of the models, please refer to the help function for all the arguments.
Then, a training is started with
after train --name diff_model_name --db_path /dataset/path_latent_codes --emb_model_path AE_model_run_path/export.ts --config CONFIG_NAMEDifferent configurations are available in diffusion/configs and can be combined :
| Category | Config | Description |
|---|---|---|
| Model | base | Default audio-to-audio timbre and structure separation model. |
| midi | Uses MIDI as input for the structure encoder | |
| Additional | tiny | Reduces the model's capacity for faster inference. Useful for testing and low-resource environments. |
| cycle | Experimental: adds a cycle consistency phase during training, which can improve timbre and structure disentanglement. |
The tensorboard logs and checkpoints are saved to /diffusion/runs/model_name, and you can experiment with you trained model using the notebooks notebooks/audio_to_audio_demo.ipynb and notebooks/midi_to_audio_demo.ipynb.
Once the training is complete, you can export the model to an nn_tilde torchscript file for inference in MaxMSP, PureData or Ableton Live.
For an audio-to-audio model :
after export --model_path diff_model_name --emb_model_path AE_model_run_path/export_stream.tsFor a MIDI-to-audio model :
after export_midi --model_path diff_model_name --emb_model_path AE_model_run_path/export_stream.ts Make sure to use the streaming version of the exported autoencoder (denoted by _stream.ts).
You can experiment with inference in MaxMSP using the patches in ./patchs and the pretrained models available here.
We also provide two Max4Live devices to use your models in Ableton Live. By default the export scripts trains a small network to remap the timbre latent space to a 2D map that can be used for latent exploration in our Max4Live device (see below). If you use multiple datasets, each dataset will correspond to one color on the latent map. The 2D map is used for coarse latent control, which you can refine by directly changing the latent dimensions. Make sure to download the .ts file along with .png latent map created with the export script.
AFTER has been applied in several projects:
- The Call by Holly Herndon and Mat Dryhurst, an interactive sound installation with singing voice transfer, at Serpentine Gallery in London until February 2, 2025.
- A live performance by French electronic artist Canblaster for Forum Studio Session at IRCAM. The full concert is available on YouTube.
- Nature Manifesto, an immersive sound installation by Björk and Robin Meier, at Centre Pompidou in Paris from November 20 to December 9, 2024.
We look forward to seeing new projects and creative uses of AFTER.