The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language. NAACL 2024.
See kws_example.ipynb and forced_alignment_example.ipynb for a comprehensive example.
git clone https://github.com/lingjzhu/clap-ipa
cd clap-ipa
pip install .
For CLAP-IPA
from clap.encoders import *
import torch.nn.functional as F
from transformers import DebertaV2Tokenizer, AutoProcessor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
speech_encoder = SpeechEncoder.from_pretrained('anyspeech/clap-ipa-tiny-speech')
phone_encoder = PhoneEncoder.from_pretrained('anyspeech/clap-ipa-tiny-phone')
phone_encoder.eval().to(device)
speech_encoder.eval().to(device)
tokenizer = DebertaV2Tokenizer.from_pretrained('charsiu/IPATokenizer')
processor = AutoProcessor.from_pretrained('openai/whisper-tiny')
audio_input = processor(some_audio)
ipa_input = tokenizer(some_ipa_string)
with torch.no_grad():
speech_embed = speech_encoder(audio_input)
phone_embed = phone_encoder(ipa_input)
similarity = F.cosine_similarity(speech_embed,phone_embed,dim=-1)For IPA-Aligner, the example usage is in forced_alignment_example.ipynb.
The full forced-alignment evaluation code is in evaluate/eval_boundary.py.
For training, you can download data from HuggingFace hub. Then sample train/val filelists are available in data/.
python train.py -c config/clap_ipa/base.yaml
Evaluation code is available in evaluate. Each evalaute code script has almost the same organization, so you can simply pass the .ckpt checkpoint after training to evaluate their performance. Please check the evalaution code for usage.
python evaluate_fieldwork.py --data ucla --checkpoint "last.ckpt"
Weights are released under MIT License.
| Model | Phone Encoder | Speech encoder |
|---|---|---|
| CLAP-IPA-tiny | anyspeech/clap-ipa-tiny-phone |
anyspeech/clap-ipa-tiny-speech |
| CLAP-IPA-base | anyspeech/clap-ipa-base-phone |
anyspeech/clap-ipa-base-speech |
| CLAP-IPA-small | anyspeech/clap-ipa-small-phone |
anyspeech/clap-ipa-small-speech |
| IPA-Aligner-tiny | anyspeech/ipa-align-tiny-phone |
anyspeech/ipa-align-tiny-speech |
| IPA-Aligner-base | anyspeech/ipa-align-base-phone |
anyspeech/ipa-align-base-speech |
| IPA-Aligner-small | anyspeech/ipa-align-small-phone |
anyspeech/ipa-align-base-speech |
All datasets are distributed as wds files on huggingface hub.
- FLEURS-IPA: https://huggingface.co/datasets/anyspeech/fleurs_ipa
- MSWC-IPA: https://huggingface.co/datasets/anyspeech/mswc_ipa
- DORECO-IPA: https://huggingface.co/datasets/anyspeech/doreco_ipa
After this study, we found that these datasets still contain inconsistent unicode encoding of IPA symbols.
A cleaner version will be released when we finish another round of data cleaning.
The clean data (~1.8TB) and the models trained on clean data are available at: https://github.com/lingjzhu/zipa
from huggingface_hub import snapshot_download
snapshot_download(repo_id="anyspeech/fleurs_ipa", repo_type="dataset", local_dir="your_own_folder",local_dir_use_symlinks=False,resume_download=False,max_workers=4)import webdataset as wds # Note the typical import shorthand
dataset = (
wds.WebDataset("data-archives/shard-00{00...24}.tar") # 25 shards
.decode() # Automagically decode files
.shuffle(size=1000) # Shuffle on-the-fly in a buffer
.batch(batchsize=10) # Create batches
)@inproceedings{zhu-etal-2024-taste,
title = "The taste of {IPA}: Towards open-vocabulary keyword spotting and forced alignment in any language",
author = "Zhu, Jian and
Yang, Changbing and
Samir, Farhan and
Islam, Jahurul",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.43/",
doi = "10.18653/v1/2024.naacl-long.43",
pages = "750--772"
}