X-ARES (pron /eks ˈeəz/) is an audio encoder benchmark that automatically trains task-specific output layers (including MLPs and k-NN) on user provided encoders. It is heavily inspired by the HEAR benchmark.
Users provide a single pretrained audioencoder, which outputs frame-level embeddings. Embeddings are evaluated using a fine-tuned MLP layer for clip- and frame-level tasks. Further a non-parameterized kNN algorithm is used to evaluate the quality of embeddings. For specialized tasks, pre-trained decoders are incorporated as task-specific components.
- ASV2015
- CREMA-D
- Fluent Speech Commands
- LibriCount
- LibriSpeech-ASR
- LibriSpeech-Male-Female
- RAVDESS
- Speech Commands V2
- speechocean762
- VocalSound
- VoxCeleb1
- VoxLingua107
- Clotho
- DESED
- ESC-50
- FSD18-Kaggle
- FSD50k
- UrbanSound 8k
- Finger snap sound1
- Inside/outside car1
- Key scratching car1
- LiveEnv sounds1
- Subway broadcast1
- FMA
- GTZAN Genre
- MAESTRO
- NSynth
# Recommended: using uv
uv sync --extra examples
# Or via pip
pip install xares[examples]For development, clone and install in editable mode:
git clone <this-repo>
cd xares
uv sync --extra examples # or: pip install -e .[examples]Run the benchmark with the baseline encoder (Dasheng) on all tasks:
uv run xares --max-jobs 8 example/dasheng/dasheng_encoder.py src/tasks/*.pyOr run specific tasks:
uv run xares example/dasheng/dasheng_encoder.py src/tasks/esc50_task.py src/tasks/gtzan_task.pyDatasets are downloaded automatically from Zenodo. If that fails, use tools/download_manually.sh.
python -m xares.run also works as an alternative to uv run xares.
X-ARES provides two evaluation methods to assess the quality of audio representations: MLP (Linear Fine-Tuning) and kNN (Unparameterized Evaluation).
MLP: Linear Fine-Tuning on Task-Specific Data. A linear layer will be trained using the provided user embeddings, optimized with predefined hyperparameters for each task. This approach assesses how effectively the fixed representations can be adapted to specific tasks by training an additional linear layer, using predefined hyperparameters tailored for each task. This method evaluates the adaptability and effectiveness of the pre-trained models when applied to new, task-specific contexts without altering the original model parameters.
kNN: Unparameterized Evaluation. Pre-trained model embeddings will be used directly for K-nearest neighbor (KNN) classification without training. This method aims to evaluate the inherent quality of the audio representations without any fine-tuning. While this approach may not always yield the highest performance in real-world applications, it serves as a rigorous test of the fundamental representational power of the embeddings. By avoiding parameterized layers, this method provides a clear view of how well the model captures essential features of the audio data.
Here are the evaluation results for several baseline models using MLP and kNN methods. The weighted average is calculated using the test set size for each dataset.
| Task | dasheng | wav2vec2 | whisper | data2vec |
|---|---|---|---|---|
| ASV2015 | 0.964 | 0.924 | 0.966 | 0.937 |
| Clotho | 0.029 | 0.014 | 0.038 | 0.008 |
| CREMA-D | 0.767 | 0.541 | 0.572 | 0.523 |
| DESED | 0.537 | 0.313 | 0.127 | 0.136 |
| ESC-50 | 0.857 | 0.510 | 0.528 | 0.229 |
| Fluent Speech Commands | 0.946 | 0.468 | 0.776 | 0.978 |
| Free Music Archive Small | 0.643 | 0.469 | 0.581 | 0.334 |
| FSD50k | 0.409 | 0.166 | 0.262 | 0.085 |
| FSD18-Kaggle | 0.534 | 0.241 | 0.241 | 0.153 |
| GTZAN Genre | 0.851 | 0.630 | 0.622 | 0.448 |
| LibriCount | 0.681 | 0.583 | 0.549 | 0.492 |
| LibriSpeech-100h | 0.608 (0.724 v0.1.3) | 0.405 (0.545 v0.1.3) | 0.721 (0.815 v0.1.3) | 0.893 (0.866 v0.1.3) |
| LibriSpeech-MF | 0.986 | 0.948 | 0.973 | 0.752 |
| MAESTRO | 0.524 | 0.180 | 0.011 | 0.116 |
| NSynth-Instruments | 0.688 | 0.443 | 0.532 | 0.336 |
| RAVDESS | 0.749 | 0.442 | 0.459 | 0.467 |
| Speech Commands V1 | 0.969 | 0.714 | 0.933 | 0.927 |
| UrbanSound 8k | 0.833 | 0.659 | 0.687 | 0.426 |
| Vocal Imitation | 0.253 | 0.147 | 0.180 | 0.128 |
| VocalSound | 0.910 | 0.768 | 0.860 | 0.803 |
| VoxCeleb1 | 0.780 | 0.340 | 0.388 | 0.105 |
| VoxLingua33 | 0.814 | 0.553 | 0.873 | 0.620 |
| Key scratching car1 | 0.999 | 0.983 | 0.985 | 0.909 |
| Finger snap sound1 | 0.870 | 0.872 | 0.861 | 0.808 |
| Inside/outside car1 | 0.972 | 0.928 | 0.866 | 0.869 |
| Live Env 1 | 0.986 | 0.955 | 0.887 | 0.759 |
| Subway broadcast1 | 0.972 | 0.930 | 0.942 | 0.869 |
| Weighted Average (public tasks) | 0.699 | 0.490 | 0.632 | 0.598 |
| Weighted Average (all tasks) | 0.801 | 0.664 | 0.740 | 0.694 |
| Task | dasheng | wav2vec2 | whisper | data2vec |
|---|---|---|---|---|
| ASV2015 | 0.869 | 0.858 | 0.843 | 0.942 |
| CREMA-D | 0.380 | 0.221 | 0.372 | 0.351 |
| ESC-50 | 0.618 | 0.081 | 0.191 | 0.040 |
| Fluent Speech Commands | 0.260 | 0.017 | 0.032 | 0.630 |
| Free Music Archive Small | 0.592 | 0.251 | 0.406 | 0.106 |
| GTZAN Genre | 0.758 | 0.303 | 0.350 | 0.108 |
| LibriCount | 0.311 | 0.235 | 0.246 | 0.176 |
| LibriSpeech-MF | 0.791 | 0.606 | 0.617 | 0.724 |
| NSynth-Instruments | 0.499 | 0.251 | 0.205 | 0.179 |
| RAVDESS | 0.408 | 0.169 | 0.296 | 0.313 |
| Speech Commands V1 | 0.903 | 0.208 | 0.096 | 0.852 |
| UrbanSound 8k | 0.662 | 0.339 | 0.215 | 0.156 |
| Vocal Imitation | 0.107 | 0.010 | 0.016 | 0.018 |
| VocalSound | 0.382 | 0.269 | 0.405 | 0.308 |
| VoxCeleb1 | 0.262 | 0.003 | 0.010 | 0.033 |
| VoxLingua33 | 0.376 | 0.034 | 0.360 | 0.058 |
| Key scratching car1 | 0.955 | 0.923 | 0.691 | 0.550 |
| Finger snap sound1 | 0.848 | 0.787 | 0.401 | 0.461 |
| Inside/outside car1 | 0.798 | 0.575 | 0.730 | 0.588 |
| Subway broadcast1 | 0.949 | 0.533 | 0.884 | 0.530 |
| Weighted Average (public tasks) | 0.504 | 0.262 | 0.299 | 0.388 |
| Weighted Averag (all tasks) | 0.683 | 0.469 | 0.475 | 0.455 |
Examples of audio encoder wrapper could be found at examples, where the baseline encoders are implemented.
We provide a check function to verify if the encoder is correctly implemented:
>>> from xares.audio_encoder_checker import check_audio_encoder
>>> encoder = YourEncoder()
>>> check_audio_encoder(encoder)
TrueThen run the benchmark with your own encoder:
uv run xares --max-jobs 8 your_encoder.py src/tasks/*.pyBy sure that your encoder supports variable length inference up to 10 minutes of audio. We recommend to simply chunk the input audio in your encoder to mitigate any out-of-memory issues, like:
class MyCustomEncoder(torch.nn.Module):
def __init__(self):
super().__init__()
self.sampling_rate = 16000
self.output_dim = 512
self.hop_size_in_ms = 10
self.model = my_model_implementation()
# This code is only for cases where the model itself does not implement chunking
self.custom_max_audio_length = int(self.sampling_rate * 10)
def forward(self, audio: torch.Tensor):
if audio.ndim == 1:
audio = audio.unsqueeze(0)
self.model.eval()
with torch.inference_mode():
if audio.shape[-1] > self.custom_max_audio_length:
embeds = []
for chunk in audio.split(self.custom_max_audio_length, dim=-1):
if chunk.shape[-1] < self.sampling_rate:
chunk = torch.nn.functional.pad(
chunk, (0, self.sampling_rate - chunk.shape[-1]))
embed = self.model(chunk)
embeds.append(embed)
encoded_audio = torch.cat(embeds, dim=1)
else:
encoded_audio = self.model(audio)
return encoded_audioAdding a new task is easy. Refer to the existing task implementations for guidance.
You need to create a TaskConfig tailored to your chosen dataset.
@inproceedings{zhang2025xares,
title={X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance},
author={Zhang, Junbo and Dinkel, Heinrich and Niu, Yadong and Liu, Chenyu and Cheng, Si and Zhao, Anbei and Luan, Jian},
booktitle={Interspeech 2025},
year={2025}
}