0% found this document useful (0 votes)
47 views9 pages

Convert and Speak: Zero-Shot Accent Conversion With Minimum Supervision

The document presents a two-stage generative framework called 'convert-and-speak' for zero-shot accent conversion that minimizes the need for parallel data. By operating on semantic tokens, the framework allows for effective accent conversion and speech synthesis using a generative model, achieving state-of-the-art performance with only 15 minutes of weakly parallel data. The proposed method demonstrates adaptability across various accents, significantly reducing the complexity and latency of speech generation.

Uploaded by

csdanudeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views9 pages

Convert and Speak: Zero-Shot Accent Conversion With Minimum Supervision

The document presents a two-stage generative framework called 'convert-and-speak' for zero-shot accent conversion that minimizes the need for parallel data. By operating on semantic tokens, the framework allows for effective accent conversion and speech synthesis using a generative model, achieving state-of-the-art performance with only 15 minutes of weakly parallel data. The proposed method demonstrates adaptability across various accents, significantly reducing the complexity and latency of speech generation.

Uploaded by

csdanudeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Convert and Speak: Zero-shot Accent Conversion with

Minimum Supervision
Zhijun Jia* Huaying Xue
Nanjing University Microsoft Research Asia
Nanjing, China Beijing, China
jiazhijun@smail.nju.edu.cn huxue@microsoft.com

Xiulian Peng Yan Lu


Microsoft Research Asia Microsoft Research Asia
arXiv:2408.10096v2 [cs.SD] 22 Aug 2024

Beijing, China Beijing, China


xipe@microsoft.com yanlu@microsoft.com
Abstract ACM Reference Format:
Low resource of parallel data is the key challenge of accent conver- Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu. 2024. Convert and
Speak: Zero-shot Accent Conversion with Minimum Supervision. In Pro-
sion(AC) problem in which both the pronunciation units and prosody
ceedings of the 32nd ACM International Conference on Multimedia (MM
pattern need to be converted. We propose a two-stage generative ’24), October 28–November 1, 2024, Melbourne, VIC, Australia. ACM, New
framework "convert-and-speak" in which the conversion is only op- York, NY, USA, 9 pages. https://doi.org/10.1145/3664647.3681539
erated on the semantic token level and the speech is synthesized
conditioned on the converted semantic token with a speech genera-
tive model in target accent domain. The decoupling design enables 1 Introduction
the "speaking" module to use massive amount of target accent speech Accent brings a barrier of understanding when having a conversation
and relieves the parallel data required for the "conversion" module. between speakers with different accents. The technology of accent
Conversion with the bridge of semantic token also relieves the re- conversion aims to break such barriers to make the source accent
quirement for the data with text transcriptions and unlocks the usage speakers sound like target accent speakers by changing the pronunci-
of language pre-training technology to further efficiently reduce ation pattern and prosody while preserving the linguistic content and
the need of parallel accent speech data. To reduce the complexity his/her own speaker identity. This problem is quite challenging since
and latency of "speaking", a single-stage AR generative model is the accent feature affects speech in many aspects such as intonation,
designed to achieve good quality as well as lower computation cost. rhythm and pronunciation patterns[8]. Take Indian-English for ex-
Experiments on Indian-English to general American-English con- ample, they may pronounce ′ 𝑣 ′ as ′𝑤 ′ or vice versa, ′𝑡ℎ ′ as ′𝑡 ′ or
version show that the proposed framework achieves state-of-the-art ′𝑑 ′ and ′ 𝑝 ′ as ′𝑏 ′ . Besides such pronunciation units difference, the
performance in accent similarity, speech quality, and speaker main- prosody, e.g. intonation and stress is also changed a lot according to
tenance with only 15 minutes of weakly parallel data which is not the accent. Another key challenge is the lack of parallel data. Strictly
constrained to the same speaker. Extensive experimentation with parallel data with one speaker speaking the same sentence with two
diverse accent types suggests that this framework possesses a high different accents barely existed in the public research area.
degree of adaptability, making it readily scalable to accommodate Some early researches [8] try to explicitly model the pronuncia-
other accents with low-resource data. Audio samples are available tion patterns by building some accent-specific dictionaries to include
at https://www.microsoft.com/en-us/research/project/convert-and- all possible pronunciations of every word according to the accent
speak-zero-shot-accent-conversion-with-minimumsupervision/. type. These methods tend to have poor adaptation because of its
assumptions that phonetic knowledge about every accent is available
CCS Concepts and can be depicted thoroughly in a dictionary and all speakers can
• Computing methodologies → Speech recognition. be categorized into a few accent clusters are hardly held in the real
scenario.
Keywords One of the conventional AC methods [5, 13, 29, 31] simplify
accent conversion, generative model, speech synthesis the accent conversion problem by just converting the voice of the
target accent speaker to that of the source accent speaker assuming
* The work was done at Microsoft during the internship of Zhijun Jia. the utterance with the same content in target accent is available.
This kind of method just requires to extract the speaker identity
Permission to make digital or hard copies of all or part of this work for personal or from the source speech without disentangling content with prosody
classroom use is granted without fee provided that copies are not made or distributed which makes it easier to achieve accent conversion. However, these
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. methods hinder their usage in real applications since the target accent
For all other uses, contact the owner/author(s). reference is hardly available at conversion stage.
MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia. Therefore, reference-free AC methods are more practical and
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0686-8/24/10 appealing for usage. Some previous approaches [17, 30] try to learn
https://doi.org/10.1145/3664647.3681539 the acoustic mapping between the source accent speech and target
MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia. Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu

accent speech directly with the parallel data in which the same the pre-training technology on the conversion part to largely reduce
speaker speaks the same content with two different accents. However, the amount of parallel supervision to only 15 minutes of weakly
such data is extremely rare. So the main idea of this kind of method parallel data. (iii)This framework can be easily extended to other
is to use the voice conversion(VC) technology [14, 24] to synthesize low-resource accents as our experiments on Chinese-English accent
the data set by converting the speaker identity of the target accent and Korean-English accent shown. (iv)We propose a single-stage
speech to that of the source speaker. Such end-to-end mapping-based speech generative model based on TF-Codec with better speech
methods require large amounts of strictly parallel data to achieve a quality and speaker similarity at lower computation cost and latency
good conversion quality and generalization ability. However, such compared with the multi-stage generation process in other generative
massive high-quality data can hardly be achieved and the distortions models based on Encodec (proposed: 50 AR steps/1 sec of audio vs
are introduced from the VC stage, even though these VC models Encodec-based: 75 AR steps+7 NAR steps/1 sec of audio).
have been fine-tuned on the target AC dataset. The paper is organized as follows. Section 2 introduces the back-
To relieve the dependence of parallel data, another kind of ap- ground of accent conversion and speech generative models. Section 3
proaches [9, 15, 33] which leverage disentanglement technology to introduces the proposed framework in detail. Section 4 validates the
remove accent from content, speaker identity, prosody and resyn- performance of the proposed framework mainly on Indian-English
thesize to the target waveform through a synthesizer, e.g. text-to- accent to general American-English accent and extensive exper-
speech(TTS) model [21]. The synthesizer is trained on the target iments are undertaken to substantiate the efficacy of our model
accent speech to generate speech with prosody in target accent. To design. Section 5 concludes the paper.
remove accent from content and speaker identity, some auxiliary
models or tasks are carefully designed, e.g. accent-agnostic auto- 2 Background
matic speech recognition(ASR) model or phoneme classification
task. Text transcriptions are largely used in these solutions to provide 2.1 Accent conversion
the supervision of accent-agnostic semantic representation. Such For accent conversion task, there has not been a public parallel
two-stage mapping-based methods still require large amounts of corpus that contains pairs of audios having the same contents yet
(text-accent speech) pairs combined with dedicated auxiliary tasks coming from the same speakers in different accents. So mainly two
to achieve accent-agnostic semantic feature and generate diverse kinds of methods are proposed to accomplish this task in the litera-
speech in target accent. ture. One is to synthesize the dataset containing the pairs of audios
In this work, we propose a two-stage generative framework with in the same voice but in two different accents with another voice
conversion stage and speaking stage to achieve accent conversion. conversion model and learn the acoustic mapping between them to
The conversion stage is operated on semantic level by generating the accomplish accent conversion. [30] build the golden speaker utter-
semantic tokens in target accent from source accent. The speaking ance by converting the general American-English speaker’s voice
stage is using a generative-based synthesis model conditioned on to the source-accent speaker’s with a pretrained source speaker’s
the converted semantic tokens to generate the speech with prosody synthesizer. Then use this golden speaker utterance as the target to
in target accent. Splitting the AC task into these two sub-tasks and learn the mapping of the mel-spectrogram based on a seq2seq VC
realizing conversion with the bridge of semantic tokens which are system. [17] use a pretrained VC model to build the parallel data
extracted from raw speech enable the "speaking" module to be in- and trained the AC model based on Tacotron[21] conditioned on
dependent of parallel data and use massive amount of target accent the semantic representation extracted from wav2vec 2.0[1]. This
speech without text transcriptions to generate speech with good qual- end-to-end mapping-based approach needs large amounts of data to
ity and diversity in the target accent domain. Meanwhile, it makes achieve a good zero-shot ability and the auxiliary VC model usually
it easier for the conversion part to just learn the pronunciation pat- needs to be fine-tuned on the AC data set to alleviate the error caused
tern/phoneme difference with a small amount of weakly parallel data by voice conversion step. These methods also constrain the output
which is not constrained to the same speaker. to be generated with the same length of the input which limits the
Both of the stages are seq2seq tasks based on decoder-only Trans- conversion quality since the prosody of the speech is largely affected
former architectures [22]. Inspired by the ideas from machine trans- by the accent.
lation community to reduce the need for supervision, we leverage the Another approach is to regard accent conversion as a decompo-
BART/T5-style pre-training [12] to significantly reduce the amount sition and resynthesis task in which the accent is separated from
of parallel supervision required to train the conversion part. Such pre- content, speaker identity, prosody and resynthesize to the target
training with a pretext task on target-accent data is designed to learn waveform in a TTS manner. [15] disentangles different features in
the pattern of generating semantic units, e.g. the joint probability multi-stage with several off-the-shelf models. Specifically, an accent-
of phonetic units in target accent space. Furthermore, to reduce the robust ASR model is trained using source accent speech with text
complexity and latency of the speech generation process, we design labels to separate the source accent from the content. A multi-speaker
a single-stage autoregressive generative model which generates all TTS model with a global speaker encoder is trained with a large
vector quantizers(VQs) in one step based on TF-Codec [10]. corpus of target-accent speech to map the accent-agnostic linguistic
Our contributions:(i)We propose a state-of-the-art generative- features to acoustic features with the voice of source speaker and
based framework for accent conversion which is capable of convert- target accent prosody. Similarly, [33] learns the semantic embed-
ing prosody pattern as well as pronunciation units, as evaluated with dings directly from the accent speech with the supervision of text
objective and subjective metrics on public Indian-English accent embeddings extracted from text labels. Such two-stage mapping-
to general American-English accent conversion test set. (ii)We use based non-parallel AC approach relieves the burden on the parallel
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.

data but needs to leverage large amounts of (text-accent speech) data prosody is re-synthesized in target accent with a speech generative
and dedicated auxiliary tasks. With regard to performance, these model. Specifically, we use a pre-trained self-supervised speech
methods are not good enough in conversion quality and diversity. It representation model, e.g. HuBERT[6] to extract discrete semantic
also suffers from poor zero-shot generalization ability with limited tokens. A neural codec based speech generative model is used to
high-quality data available. The same accent speaker is used in their generate the acoustic codes of the codec conditioned on the con-
training and testing. Another work [9] treats the decomposition and verted semantic tokens with the style prompt to maintain the source
resynthesis in an end-to-end manner. It designs a Pseudo Siamese speaker’s voice. Both the conversion and generative models are
Disentanglement Network (PSDN) with two streams in which one based on autoregressive decoder-only Transformer structure. More
stream is used to learn the acoustic feature of target accent speech details are discussed in Section 3.2 and Section 3.3.
and the other auxiliary stream is used to build the information gap
with the target stream to disentangle the content with accent, com- 3.2 Semantic token conversion
plemented with another adversarial accent classifier with gradient
The conversion module is designed as a seq2seq task in discrete
reversal layer(GRL). This framework can be used in the zero-shot
semantic token space in which the source accent semantic tokens
scenario but the performance is not clear since their demo page is
are converted to the target accent semantic tokens iteratively in
out-of-date and the metrics in their paper can not be compared with
an autoregressive manner. To handle the shortage of parallel data,
other public works.
inspired by BART and T5[12, 20], we use large amounts of target
Compared to prior AC approaches, the proposed generative frame-
accent data to pre-train the conversion module with a pretext task in
work neither requires large amounts of parallel data spoken by the
our scenario. We then fine-tune the conversion module with a small
same speaker nor supervision from text labels or auxiliary tasks to
amount of weakly parallel data.
achieve a good conversion quality.
Pre-training. In our scenario, the pretext task is designed to
build the probability space of discrete semantic tokens in the target
2.2 Speech generative models accent domain so that the target accent semantic tokens can be
Recently, the speech generative models show large potential in gen- generated according to its context of previous tokens in the target
erating contextual consistent, natural and diverse audio/speech based accent domain in a closed-loop manner. In this pretext task, the
on a speech neural codec with the in-context learning of a referenced model is trained in a self-supervised manner which is to produce
prompt. AudioLM[2], designed for zero-shot audio generation, uses the original token sequence 𝑌 = {𝑦0, ...𝑦𝑡 }, 𝑡 < 𝑇 conditioned on the
Soundstream[27] codes as intermediate representation of acoustic corrupted token sequence𝑌¯ = {𝑦¯0, ...𝑦¯𝑡 }, 𝑡 < 𝑇 , formulated as
features for speech synthesis. It also shows the strong ability of
in-context learning with a short prompt to maintain acoustic informa- 𝑇
Ö
tion such as speaker identity, prosody style and acoustic environment 𝑝 (𝑌 |𝑌¯ ; 𝜃 𝐴𝑅 ) = 𝑝 (𝑦𝑡 |𝑦 <𝑡 , 𝑌¯ ; 𝜃 𝐴𝑅 ) (1)
in the continuations. VALL-E[23], verified in the TTS task, based 𝑡 =0
on Encodec[4] tokens has also shown better zero-shot ability, speech We have experimented with corruptions like token masking, token
naturalness and diversity than non-generative based TTS models. deletion, and token in-filling and we find the token in-filling scheme
For model structure, these models take the decoder-only auto- works the best. Specifically, following the text in-filling scheme
regressive Transformer structure[22] to build the conditional correla- in BART[12], a number of text spans are sampled according to
tion of the acoustic features tokenized by a neural speech codec and 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑖 (𝑝) where 𝑝 = 0.5. The span lengths are drawn from a
the semantic tokens/phoneme sequences. When generating codes, Poisson distribution(𝜆 = 5). We train the pretext task with large
they usually need multiple stages since the neural codecs they use amounts of target accent data which is available in the public corpus.
are built on residual vector quantization(RVQ) which consists of a hi- Fine-tuning. Since some phonemes in the source accent need
erarchy of all VQs. In AudioLM, the first several quantizer layers are to be converted to the target accent ones, a mapping between these
predicted in the first stage to get the coarse information of the speech phonemes needs to be learned. Specifically, we fine-tune the pre-
and the rest layers are predicted based on the coarse layers to get the trained conversion model conditioned on the semantic tokens in
fine details of the speech. VALL-E simplifies the generating process source accent with a small amount of weakly parallel accent data.
by replacing the second stage with non-autoregressive(NAR). In its Correspondingly, the training can be formulated as
design, the first quantizer is generated with the AR model and the
others are generated with the NAR model(all frames are predicted si- 𝑇
Ö
multaneously when predicting each codebook) based on the previous 𝑝 (𝑌 |𝑋 ; 𝜃 𝐴𝑅 ) = 𝑝 (𝑦𝑡 |𝑦 <𝑡 , 𝑋 ; 𝜃 𝐴𝑅 ) (2)
quantizers. 𝑡 =0
Different from these existing generative models, we propose a in which 𝑋 = {𝑥 0, ...𝑥𝑡 }, 𝑡 < 𝑇 is the source accent semantic token
one-stage AR generative model which generates all VQs in one step sequence and 𝑌 = {𝑦0, ...𝑦𝑡 }, 𝑡 < 𝑇 is the target accent semantic
to achieve lower complexity and latency as well as better quality. token sequence.

3 Proposed framework 3.3 Target accent speech generation


3.1 Overview The target accent speech generation is achieved by training a separate
In the proposed framework, as shown in Figure 1, the pronuncia- generative model on a large target accent speech corpus. We design
tion patterns are converted at discrete semantic token level and the a new speech generative model based on TF-Codec[10]. This model
MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia. Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu

Figure 1: Proposed framework. the source accent semantic tokens are converted to target accent semantic tokens in the first stage and
the speech is generated with target accent prosody conditioned on the converted semantic tokens in the second stage. The style prompt
is extracted from the first 3 seconds of the source speech. TF-Codec token is a group of concatenated embeddings of each quantizer.

generates acoustic tokens of TF-Codec iteratively through a single- in which 𝑌 = {𝑦0, ...𝑦𝑡 }, 𝑡 < 𝑇 is the semantic token sequence from
stage causal speech generation, conditioned on the converted/target target accent speech. 𝐶 : is TF-Codec token sequence of target accent
accent semantic tokens. speech. 𝐶e: is the TF-Codec token sequence of style acoustic prompt.
We do not distinguish 𝐶e: from 𝐶 : in training. The concatenation of
3.3.1 Speech tokenizer with TF-Codec. We use the pre-trained 𝐶e: and 𝐶 : is a whole sequence. During inference, the first 3 seconds
causal speech neural codec TF-Codec to extract the acoustic token of of the source speech is used as 𝐶e: .
each frame. Unlike [10], we remove the predictive loop and use the
non-predictive model at 6 kbps for efficient acoustic modeling with
high-quality output. Specifically, the TF-Codec takes the 16kHz
magnitude-compressed time-frequency spectrum with a window
4 Experiments
length of 20 ms and a hop length of 5 ms as input. Then a stack To evaluate the performance of the proposed framework, we take
of 2D causal convolutional layers, followed by a temporal convo- Indian-English as source accent and general American-English as
lutional module (TCM) and a gated recurrent unit (GRU) block is target accent which is a common scenario in the research literature.
used to capture the short-term and long-term temporal dependencies
between the input frames in a causal manner. For the quantization,
it combines 4 frames together, producing a frame rate of 50 Hz for 4.1 Experimental Setup
quantization. Instead of using RVQ, it employs group quantization Dataset. To train the speech generative model and pre-train the
where the latent embedding is split into 𝐾 groups and each group conversion model, LibriTTS dataset [28] is used as our training data.
is quantized by a vector quantizer with a codebook of 1024 code- The dataset contains approximately 585 hours of general American-
words. All 𝐾 acoustic codes are concatenated and decoded to get the English speech data, sourced from audiobooks available on the
reconstructed waveform. public LibriVox project. To fine-tune the conversion model, L1-L2
ARCTIC dataset [11, 32] is used. L1-L2 ARCTIC dataset is a dataset
3.3.2 Single-stage causal speech generation. As the group
with accent speakers speaking the same content. To build the parallel
quantization in TF-Codec encodes each group independently, we
data, we select a general American-English speaker named "bdl" as
leverage a single-stage causal speech generation to generate acoustic
the target accent speaker and "ASI" as the Indian-English speaker.
codes of all 𝐾 quantizers simultaneously for each frame. As shown
Among all their utterances, 1000 utterances, about 50 minutes of
in Figure 1, TF-Codec token, which is the concatenated embed-
speech are used in the training, 50 utterances are used in validation
dings corresponding to all 𝐾 quantizers, is generated in one-stage
and the remaining 100 utterances are used for testing. To better
autoregressive manner conditioned on the target accent/converted se-
verify the zero-shot ability, we also add speaker p248 from VCTK
mantic tokens and style acoustic tokens. For each group embedding
dataset [25] and another 4 Indian-Englsh speakers which are not
in TF-Codec token, the dimension is 𝐷𝑡𝑜𝑘𝑒𝑛 /𝐾, in which 𝐷𝑡𝑜𝑘𝑒𝑛
used in training from L1-L2 ARCTIC dataset into testing. To be
is the dimension of the embedding in transformer. 𝐾 classification
noted that the 20 utterances from speaker p248 in VCTK are used to
heads are employed to predict the 𝐾 acoustic codes for current frame
compare with the existing machine-learning based AC method [15].
separately. The training target can be formulated as
Besides these 20 cases, we add another 20 utterances of each testing
𝑇 speaker in L1-L2 ARCTIC for objective evaluation, e.g. 120 cases in
Ö
𝑝 (𝐶 : |𝑌 , 𝐶e: ; 𝜃 𝐴𝑅 ) = 𝑝 (𝑐𝑡 |𝑐 <𝑡 , 𝑌, 𝐶e: ; 𝜃 𝐴𝑅 ) (3) total, and random 8 utterances per speaker for subjective evaluation,
𝑡 =0 e.g. 60 cases in total.
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.

Table 1: Evaluation on VCTK test set(20 cases from speaker p248 as Liu. et al’s). SPK of accent source is computed on different
utterances of the source speaker.

Framework NISQA-TTS(↑) MOS-Naturalness(↑) SPK(↑) MOS-Accent(↑)


Accent source 4.60 4.52 ± 0.06 0.594 0.5%
Referenced ground truth 4.41 4.34 ± 0.06 - 70.0%
Liu. et al [15] 3.84 3.95 ± 0.07 0.168 67.6%
Generative-only model(EnCodec) 3.65 3.84 ± 0.08 0.429 35.1%
Generative-only model(TF-Codec) 4.10 4.00 ± 0.05 0.502 35.0%
Proposed(conversion+generative) 4.24 4.08 ± 0.06 0.408 69.3%

Table 2: Evaluation on L1-L2 ARCTIC test set. LCSR of ground truth speech is calculated between ground truth utterances of different
speakers.

Framework NISQA-TTS(↑) MOS- SPK(↑) LCSR(↑) MOS-Accent(↑)


Naturalness(↑)
Accent source 3.65 4.16 ± 0.07 0.641 0.545 0.5%
Referenced ground truth 3.54 4.14 ± 0.08 - 0.744 79.3%
Generative-only model(EnCodec) 3.32 3.79 ± 0.07 0.511 0.545 35.0%
Generative-only model(TF-Codec) 3.59 3.91 ± 0.06 0.543 0.545 35.2%
Proposed(conversion+generative) 3.84 3.93 ± 0.06 0.438 0.622 74.3%

Model and configuration. For semantic tokenizer, we employ model. For each case, we infer 5 times and select the one with best
the HuBERT-Base model1 and k-means algorithm with 500 clus- LCSR metric as the choice.
ters to extract semantic tokens. It is trained on LibriSpeech[18] Baseline models. To show the superiority of the proposed frame-
mostly consisting of general American-English. It generates a dis- work, we select 3 models as our baselines. The existing machine-
crete semantic token sequence at 50Hz framerate for 16kHz audio. learning based AC method [15], which is the best model available
Previous studies[7, 19] show that HuBERT is a good representa- in the public to our knowledge. The generative-only models with-
tion of speech content and removes most of the speaker identity so out the conversion module are also used as our baselines, in which
that we can use the weakly parallel data in the fine-tuning stage of we compare with the commonly-used EnCodec-based multi-stage
the conversion module. For the acoustic tokenizer, the number of generative model and the proposed single-stage TF-Codec based
quantizers in TF-Codec (𝐾) is set to 16. The transformer used in generative model.
conversion model and generative model is the same structure with Evaluation methods on accent similarity. To evaluate the perfor-
12 layers of 16 attention heads, a feed-forward layer with dimen- mance of accent conversion, both objective and subjective metrics
sion of 4096, and a dropout layer with rate of 0.1. The embedding are used. Intuitively, we use the metric Longest Common Subse-
dimension in transformer(𝐷𝑡𝑜𝑘𝑒𝑛 ) is 1024. The generative model quence(LCS) to evaluate the similarity of the converted semantic
and pre-training stage of conversion model are trained on 8 NVIDIA token sequence and referenced target semantic token sequence. To
TESLA V100 32GB GPUs with a batch size of 4k tokens per GPU. eliminate the disturbance of the duration of each word, the dupli-
The ScaledAdam[26] optimizer is used. The learning rate is set to cated tokens in the sequence are removed. To remove the effect
0.01, with a warmup for the first 5k steps and decays exponentially. of the utterance length in the final average statistics, the Longest
The speech generative model is trained for 500k steps and the con- Common Subsequence Ratio (𝐿𝐶𝑆𝑅 = 𝐿𝐶𝑆/𝑢𝑡𝑡𝑒𝑟𝑎𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ) is
version model is trained for 100k steps. The fine-tune stage of the used. The smaller utterance length is used to calculate the LCSR of
conversion model is processed on one GPU of NVIDIA Tesla A100 the testing pair. We also use the latest state-of-the-art English accent
80GB, with a batch size of 20k tokens. The same optimizer is used. classification model CommonAccent2 [34] to identify the accent of
The learning rate is set to 2 × 10 −5 , with a warmup for the first 160 the synthesized speech. Besides, we conduct a subjective A/B testing
steps. The fine-tuning of the model is trained for 1k steps. During in which participants are asked to choose the one that sounds more
inference, we employ Top-𝑘 algorithm to generate each token, in close to general American-English accent in an A/B pair. Each A/B
which 𝑘 = 2 for conversion model and 𝑘 = 10 for speech generative pair contains cases chosen from any two of the competitors. The ac-
cent source and ground truth are also included to ensure the validity

1 https://huggingface.co/facebook/hubert-base-ls960 2 https://huggingface.co/Jzuluaga/accent-id-commonaccent_ecapa
MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia. Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu

Figure 2: Accent classification results for VCTK test set, evaluated Figure 3: Accent classification results for L1-L2 ARCTIC test set,
by CommonAccent. evaluated by CommonAccent.

of the testing. To remove the potential factor of speaker identity in the effectiveness of the conversion module. This is also verified by the
subjective testing, in Table 2, both the accent source and referenced LCSR metric on L1-L2 ARCTIC test set in Table 2, where the
ground truth are chosen from multiple speaker’s utterances, e.g. 5 proposed framework closely approaches the ground truth LCSR
Indian-English speakers and 4 general American-English speakers after conversion. The analysis on HuBERT with accent input in
from L1-L2 ARCTIC test set. To be noted, the referenced ground Section 4.5.1 also shows the HuBERT tokens are affected by the
truth is used here with the same sentence spoken by different speak- source Indian-English accent, especially for those accent speech with
ers. Particularly, the participants are trained to distinguish the accent phonetic changes, indicating the necessity of the semantic conversion
difference by listening to several pairs of <Indian-English, general module. Figure 2 and Figure 3 show the accent classification results
American-English> samples before the formal testing. 20 partici- from CommonAccent. Compared with other methods, the proposed
pants who are proficient in general American-English are invited framework converts most of the Indian-English accent input to the
to conduct these evaluations. MOS-Accent, the percentage of being target general-American English accent. In Figure 3, for the rest of
selected as general American-English accent is used as the metric of 11% cases which are identified as Indian-English accent, besides
this subjective testing. classification error, most of the cases are short and the conversion
Evaluation methods on speaker similarity and speech quality. quality of some cases is not good enough which leaves room for
To evaluate the speaker identity maintenance, the speaker similar- further improvement on the robustness of the generative framework.
ity metric is calculated as the cosine similarity of the two speaker For Liu. et al [15]’s method, it is interesting to find that there is a
vectors, which are extracted from the source accent speech and the big gap between classification metric(as bad as accent source) as
converted speech, correspondingly. WavLM-TDNN3 [3], a state-of- shown in Figure 2 and MOS-Accent(relatively good) as shown in
the-art speaker verification model, is used to get the speaker vector Table 1. We think the bad result in terms of classification metric
from a speech. To evaluate the naturalness, we use NISQA-TTS[16], comes from its poor conversion ability on the pronunciation units.
which is commonly used for synthesized speech. We also conduct Most of the pronunciation units are not converted well, e.g. the
MOS testing, in which the raters are asked to give a score ranging pair of ( ′𝑏 ′,′ 𝑝 ′ ). For prosody conversion, the quality is relatively
from 1 (lowest quality) to 5 (highest quality) according to the overall good compared with the source accent, which contributes to the
subjective quality. The MOS-Naturalness with confidence level of high subjective metric. Figure 4 shows an example of the pitch
95% is used as the metric. contour and phoneme improvements of the converted speech. The
audio samples can be found in our demo page. Overall, the proposed
4.2 Results framework achieves much better accent conversion performance as
Accent similarity. As shown in Table 1 and Table 2, the MOS- verified by both objective classification metric and subjective metric.
Accent metric of the proposed framework on both datasets ranks Speech quality and Speaker similarity. According to the
the highest and very close to the ground truth. Compared with the NISQA-TTS and MOS-Naturalness metric in Table 1 and Table 2,
generative-only models, e.g. Generative-only model(EnCodec) and the proposed framework ranks at the top level. Compared with Liu’s
Generative-only model(TF-Codec), the proposed framework highly method, the proposed framework achieves much better speech qual-
surpasses them for accent conversion performance, indicating the ity and speaker similarity. Artifacts of Liu’s model can be found
in some cases, as shown in the demo page. What’s more, we find
3 https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_
the better speech quality and speaker similarity can be achieved
verification
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.

Table 3: Complexity comparison of speech generative module. Table 4: Accent conversion quality with parallel data of 50 mins,
30 mins, 15 mins.
Model
Framework Decoding steps(/s)
parameters(M) Parallel data amount MOS-Accent(↑)
Generative-only 50 mins 59.3%
262.3 75 AR + 7 NAR
model(EnCodec) 30 mins 58.1%
Generative-only
100.8 50 AR 15 mins 56.8%
model(TF-Codec)

with TF-Codec based generative model according to the compar- 4.5 Supportive analysis
ison of Generative-only model(TF-Codec) and Generative-only 4.5.1 HuBERT tokens from accent speech. In this section, we
model(EnCodec). This can also be verified by our demo cases. Com- evaluate how accent affects the HuBERT tokens. Specifically, we
pared with Generative-only model(TF-Codec), the SPK value of build a parallel data set from L1-L2 ARCTIC dataset in which the
the proposed framework drops a bit but the subjective judgement Indian-English speaker and general American-English speaker speak
on the demo cases are quite similar. We think this is caused by the same content. Both of them are fed into the HuBERT model used
the error from the speaker vector extractor WavLM-TDNN. Since in the paper to get the semantic token sequence. The LCSR metric is
WavLM-TDNN is not trained on accent speech so the extracted vec- used to evaluate the content similarity between these two HuBERT
tor contains not only the speaker identity but also accent information. token sequences. 1000 pairs are used in this experiment. To further
So with better accent conversion, the speaker vector of the con- study the phoneme change effect, the accent cases are divided into
verted speech and the accent source speech tend to be more different, the one with phoneme changes and without phoneme changes. Since
resulting in a lower cosine similarity. It should be more reason- the speakers of each pair are different, the effect of the speaker
able to use this metric to compare Generative-only model(EnCodec) identity on the LCSR metric is also calculated as a reference. As
with Generative-only model(TF-Codec) and the proposed with Liu’s Table 5 shows, with the source accent introduced, HuBERT tokens
method since both of them are with/without accent leak in the con- have changed a lot, degrading from 0.747 to 0.569 in terms of LCSR.
verted speech. For those cases with specific phoneme changes, more tokens have
been changed from the target accent references(LCSR: 0.541).
4.3 Efficiency of single-stage causal speech
4.5.2 Accent effect on the style prompt in speaking mod-
generation
ule. We use the accent source as the style prompt in the speech
Here we compare the complexity of the proposed single-stage causal generative model to extract speaker identity of the source speaker.
speech generation scheme based on TF-Codec with the multi-stage This section is to evaluate if the accent feature will be extended
speech generation scheme based on Encodec. In the Encodec-based through the in-context learning in the speech generative model. We
generative models, two stages are usually taken, with a combination conduct empirical study to substantiate such usage. Specifically,
of autoregressive(AR) stage to generate the first quantizer and NAR we design an A/B testing to compare the accent similarity of the
stage to generate the rest of the quantizers of all time steps based synthesized speech generated with two kinds of prompts in differ-
on the previous quantizers. The Encodec used in the experiment is ent accents conditioned on the same content. For testing, we take
composed of 8 quantizers with the frame rate of 75 Hz and sample Indian-English and general American-English as two prompt types
rate of 24kHz. The complexity is shown in Table 3 in terms of model for comparison. We build 100 pairs of samples to test. Each pair
parameters and decoding steps. According to Table 3, TF-Codec contains an utterance in general American-English accent which is
based generative model saves more than 50% in model size and used to extract HuBERT semantic tokens, an utterance from a gen-
takes a pure causal decoding scheme with fewer steps. The overall eral American-English speaker and from an Indian-English speaker
RTF is 2.1 with an NVIDIA RTX A6000 GPU, 3s of the prompt. working as the style prompts. The prompts are cut to 3 seconds.
Examples can be found in our demo page. For subjective testing, 20
4.4 Training with minimum supervision participants who are college students majoring in American-English
In this section, we further reduce the parallel data used in the fine- are asked to distinguish the two synthesized speech and choose the
tuning stage of the conversion model, from 50 minutes (proposed one which sounds closer to general American-English. The percent-
in Table 1 and Table 2) to 30 minutes and 15 minutes, respectively. age of being selected as general American-English accent is used as
We use the MOS-Accent as the evaluation metric and test on the the evaluation metric. We also use the CommonAccent to identify
VCTK test set. We also add the baseline models into A/B testing for the two synthesized speeches. As Table 6 shows, no matter general
comparison. As shown in Table 4, by decreasing the data amount, American-English or Indian-English prompt type, the percentage
the performance drop is negligible. With the minimum supervision to be selected as general American-English by users is about 50%
of 15 minutes, the performance is still relatively good, which shows and almost all the synthesized speeches are identified as general
its high potential for extension on other accents with low-resource American-English. Furthermore, we find accent prompts do have
data, such as Chinese-English accent and Korean-English accent to effect on the prosody modeling but are quite limited. According
general American-English accent cases in Section 4.7. to our experiments on the effect of accent prompt length, with the
prompt in Indian-English accent becomes longer, increased from 3s
MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia. Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu

Table 5: HuBERT tokens from accent speech. Lower LCSR means less similarity between source accent speech and target accent
speech. Source accent: Indian-English. Target accent: general American-English.

Influencing factors LCSR(↑)


Speaker identity 0.747
Accent without phoneme changes(w. speaker identity change) 0.569
Accent with phoneme changes(w. speaker identity change) 0.541

Table 6: Comparison of the synthesized speech with two accent prompt types: general American-English and Indian-English. Com-
monAccent metric shows the percentage of being predicted as general American-English. A/B testing metric shows the percentage of
being selected as general American-English in the A/B pair.

Prompt type CommonAccent A/B testing


General American-English accent 98% 50%
Indian-English accent 97% 50%

to 7s, the percentage of predicted general American-English accent model. The experiments show LCSR of generative-only model with-
drops from 84% to 73%. So we can use 3 seconds of accent source out decoupling on L1-L2 ARCTIC test set will be largely dropped
as a prompt to catch the source speaker’s identity without bringing to 0.02 (0.622 for the proposed), indicating most of the content has
the source accent back to the converted speech. been destroyed and the model fails to learn such mapping with so
little amount of parallel data.
The effect of pre-training for semantic conversion module To
verify the validity of the language pre-training technology used for
the semantic conversion module, we compare it with the solution
where the conversion model is trained from scratch with the weakly
parallel data. All parallel data(about 50 mins) are used for training.
Without pre-training, the results degrade by a large margin to 0.103
in LCSR(0.622 for the proposed). This is reasonable since the pre-
training stage lays a good foundation for the fine-tuning stage to just
focus on learning few semantic units which are different in the two
accents.

4.7 Extensions to more source accent types


More source accents such as Chinese-English and Korean-English
accents are supported in our experiment with 15 minutes of parallel
data. The conversion accuracy of Chinese-English accent is 95% and
Korean-English accent is 100% in CommonAccent metric. Audio
samples can be found in our demo page.

5 Conclusions
In this work, we propose a two-stage generative framework for
accent conversion task in which the conversion is operated on the
semantic token level and synthesized to a target accent speech with
TF-Codec based generative model. Experimental results show the
Figure 4: An example of pitch contour and phonemes improve- proposed framework achieves the state-of-the-art performance in
ment. (Content: "It’s also very valuable.") terms of accent similarity, speech quality and speaker maintenance
with limited parallel data. With the language pre-training technology,
only 15 minutes of parallel data, not constrained to the same speaker
reaches to a good conversion quality, which shows large potential
4.6 Ablation Study for an easy extension for other accents with low-resource data. The
Decoupling design We compare with the solution in which the par- proposed single-stage AR generative model achieves better speech
allel data is used to fine-tune the generative-only model directly. In quality at lower complexity, which can be used for other speech
such a way, the model is guided to learn the phoneme and prosody generative tasks. In the future, we will further improve the robustness
conversion simultaneously and blindly through an AR Transformer of the generative framework for AC task.
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.

References [18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015.
[1] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE
wav2vec 2.0: A framework for self-supervised learning of speech representations. international conference on acoustics, speech and signal processing (ICASSP).
Advances in neural information processing systems 33 (2020), 12449–12460. IEEE, 5206–5210.
[2] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier [19] Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-
Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. 2021. Speech resyn-
Marco Tagliasacchi, and Neil Zeghidour. 2023. AudioLM: A Language Modeling thesis from discrete disentangled self-supervised representations. arXiv preprint
Approach to Audio Generation. IEEE ACM Trans. Audio Speech Lang. Process. arXiv:2104.00355 (2021).
31 (2023), 2523–2533. [20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[3] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits
Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, of transfer learning with a unified text-to-text transformer. The Journal of Machine
Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Learning Research 21, 1 (2020), 5485–5551.
Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack [21] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly,
Speech Processing. IEEE J. Sel. Top. Signal Process. 16, 6 (2022), 1505–1518. Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.
[4] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predic-
Fidelity Neural Audio Compression. CoRR abs/2210.13438 (2022). tions. In 2018 IEEE international conference on acoustics, speech and signal
[5] Shaojin Ding, Guanlong Zhao, and Ricardo Gutierrez-Osuna. 2022. Accentron: processing (ICASSP). IEEE, 4779–4783.
[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Foreign accent conversion to arbitrary non-native speakers using zero-shot learning.
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you
Computer Speech & Language 72 (2022), 101302.
need. Advances in neural information processing systems 30 (2017).
[6] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan
[23] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie
Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised
Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Neural
Speech Representation Learning by Masked Prediction of Hidden Units. IEEE
codec language models are zero-shot text to speech synthesizers. arXiv preprint
ACM Trans. Audio Speech Lang. Process. 29 (2021), 3451–3460.
arXiv:2301.02111 (2023).
[7] Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He,
[24] Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen
and Zhou Zhao. 2022. Transpeech: Speech-to-speech translation with bilateral
Meng. 2021. Vqmivc: Vector quantization and mutual information-based unsuper-
perturbation. arXiv preprint arXiv:2205.12523 (2022).
vised speech representation disentanglement for one-shot voice conversion. arXiv
[8] Mark Huckvale. 2006. The new accent technologies: recognition, measurement
preprint arXiv:2106.10132 (2021).
and manipulation of accented speech. Beijing: Language and Culture Press.
[25] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR
[9] Dongya Jia, Qiao Tian, Jiaxin Li, Yuanzhe Chen, Kainan Peng, Mingbo Ma,
VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit
Yuping Wang, and Yuxuan Wang. 2022. Non-parallel Accent Conversion using
(version 0.92). https://doi.org/10.7488/ds/2645
Pseudo Siamese Disentanglement Network. arXiv preprint arXiv:2212.05751
[26] Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang,
(2022).
Zengrui Jin, Long Lin, and Daniel Povey. 2023. Zipformer: A faster and better
[10] Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, and Yan Lu. 2023. Latent-
encoder for automatic speech recognition. CoRR abs/2310.11230 (2023).
Domain Predictive Neural Speech Coding. IEEE/ACM Transactions on Audio,
[27] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco
Speech, and Language Processing (2023).
Tagliasacchi. 2022. SoundStream: An End-to-End Neural Audio Codec. IEEE
[11] John Kominek and Alan W. Black. 2004. The CMU Arctic speech databases. In
ACM Trans. Audio Speech Lang. Process. 30 (2022), 495–507.
SSW. ISCA, 223–224.
[28] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng
[12] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Chen, and Yonghui Wu. 2019. LibriTTS: A Corpus Derived from LibriSpeech for
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denois-
Text-to-Speech. In INTERSPEECH. ISCA, 1526–1530.
ing sequence-to-sequence pre-training for natural language generation, translation,
[29] Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna. 2019. Foreign
and comprehension. arXiv preprint arXiv:1910.13461 (2019).
Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams.. In
[13] Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang,
Interspeech. 2843–2847.
Yuxuan Wang, and Zejun Ma. 2020. Improving accent conversion with reference
[30] Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna. 2021. Converting
encoder and end-to-end text-to-speech. arXiv preprint arXiv:2005.09271 (2020).
foreign accent speech without a reference. IEEE/ACM Transactions on Audio,
[14] Yist Y Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee.
Speech, and Language Processing 29 (2021), 2367–2381.
2021. Fragmentvc: Any-to-any voice conversion by end-to-end extracting and
[31] Guanlong Zhao, Sinem Sonsaat, John Levis, Evgeny Chukharev-Hudilainen, and
fusing fine-grained voice fragments with attention. In ICASSP 2021-2021 IEEE
Ricardo Gutierrez-Osuna. 2018. Accent conversion using phonetic posterior-
International Conference on Acoustics, Speech and Signal Processing (ICASSP).
grams. In 2018 IEEE International Conference on Acoustics, Speech and Signal
IEEE, 5939–5943.
Processing (ICASSP). IEEE, 5314–5318.
[15] Songxiang Liu, Disong Wang, Yuewen Cao, Lifa Sun, Xixin Wu, Shiyin Kang,
[32] Guanlong Zhao, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Evgeny Chukharev-
Zhiyong Wu, Xunying Liu, Dan Su, Dong Yu, et al. 2020. End-to-end accent
Hudilainen, John Levis, and Ricardo Gutierrez-Osuna. 2018. L2-ARCTIC: A
conversion without using native utterances. In ICASSP 2020-2020 IEEE Interna-
Non-native English Speech Corpus. In INTERSPEECH. ISCA, 2783–2787.
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
[33] Yi Zhou, Zhizheng Wu, Mingyang Zhang, Xiaohai Tian, and Haizhou Li. 2023.
6289–6293.
TTS-Guided Training for Accent Conversion Without Parallel Data. IEEE Signal
[16] Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. 2021.
Processing Letters (2023).
Nisqa: A deep cnn-self-attention model for multidimensional speech quality pre-
[34] Juan Zuluaga-Gomez, Sara Ahmed, Danielius Visockas, and Cem Subakan.
diction with crowdsourced datasets. arXiv preprint arXiv:2104.09494 (2021).
2023. CommonAccent: Exploring Large Acoustic Pretrained Models for Ac-
[17] Tuan Nam Nguyen, Ngoc-Quan Pham, and Alexander Waibel. 2022. Accent
cent Classification Based on Common Voice. Interspeech 2023 (2023). https:
Conversion using Pre-trained Model and Synthesized Data from Voice Conversion.
//arxiv.org/abs/2305.18283
In Proc. Interspeech, Vol. 2022. 2583–2587.

You might also like