Convert and Speak: Zero-Shot Accent Conversion With Minimum Supervision
Convert and Speak: Zero-Shot Accent Conversion With Minimum Supervision
Minimum Supervision
                                                                                Zhijun Jia*                                                                  Huaying Xue
                                                                         Nanjing University                                                            Microsoft Research Asia
                                                                            Nanjing, China                                                                 Beijing, China
                                                                     jiazhijun@smail.nju.edu.cn                                                        huxue@microsoft.com
accent speech directly with the parallel data in which the same            the pre-training technology on the conversion part to largely reduce
speaker speaks the same content with two different accents. However,       the amount of parallel supervision to only 15 minutes of weakly
such data is extremely rare. So the main idea of this kind of method       parallel data. (iii)This framework can be easily extended to other
is to use the voice conversion(VC) technology [14, 24] to synthesize       low-resource accents as our experiments on Chinese-English accent
the data set by converting the speaker identity of the target accent       and Korean-English accent shown. (iv)We propose a single-stage
speech to that of the source speaker. Such end-to-end mapping-based        speech generative model based on TF-Codec with better speech
methods require large amounts of strictly parallel data to achieve a       quality and speaker similarity at lower computation cost and latency
good conversion quality and generalization ability. However, such          compared with the multi-stage generation process in other generative
massive high-quality data can hardly be achieved and the distortions       models based on Encodec (proposed: 50 AR steps/1 sec of audio vs
are introduced from the VC stage, even though these VC models              Encodec-based: 75 AR steps+7 NAR steps/1 sec of audio).
have been fine-tuned on the target AC dataset.                                The paper is organized as follows. Section 2 introduces the back-
    To relieve the dependence of parallel data, another kind of ap-        ground of accent conversion and speech generative models. Section 3
proaches [9, 15, 33] which leverage disentanglement technology to          introduces the proposed framework in detail. Section 4 validates the
remove accent from content, speaker identity, prosody and resyn-           performance of the proposed framework mainly on Indian-English
thesize to the target waveform through a synthesizer, e.g. text-to-        accent to general American-English accent and extensive exper-
speech(TTS) model [21]. The synthesizer is trained on the target           iments are undertaken to substantiate the efficacy of our model
accent speech to generate speech with prosody in target accent. To         design. Section 5 concludes the paper.
remove accent from content and speaker identity, some auxiliary
models or tasks are carefully designed, e.g. accent-agnostic auto-         2 Background
matic speech recognition(ASR) model or phoneme classification
task. Text transcriptions are largely used in these solutions to provide   2.1 Accent conversion
the supervision of accent-agnostic semantic representation. Such           For accent conversion task, there has not been a public parallel
two-stage mapping-based methods still require large amounts of             corpus that contains pairs of audios having the same contents yet
(text-accent speech) pairs combined with dedicated auxiliary tasks         coming from the same speakers in different accents. So mainly two
to achieve accent-agnostic semantic feature and generate diverse           kinds of methods are proposed to accomplish this task in the litera-
speech in target accent.                                                   ture. One is to synthesize the dataset containing the pairs of audios
    In this work, we propose a two-stage generative framework with         in the same voice but in two different accents with another voice
conversion stage and speaking stage to achieve accent conversion.          conversion model and learn the acoustic mapping between them to
The conversion stage is operated on semantic level by generating the       accomplish accent conversion. [30] build the golden speaker utter-
semantic tokens in target accent from source accent. The speaking          ance by converting the general American-English speaker’s voice
stage is using a generative-based synthesis model conditioned on           to the source-accent speaker’s with a pretrained source speaker’s
the converted semantic tokens to generate the speech with prosody          synthesizer. Then use this golden speaker utterance as the target to
in target accent. Splitting the AC task into these two sub-tasks and       learn the mapping of the mel-spectrogram based on a seq2seq VC
realizing conversion with the bridge of semantic tokens which are          system. [17] use a pretrained VC model to build the parallel data
extracted from raw speech enable the "speaking" module to be in-           and trained the AC model based on Tacotron[21] conditioned on
dependent of parallel data and use massive amount of target accent         the semantic representation extracted from wav2vec 2.0[1]. This
speech without text transcriptions to generate speech with good qual-      end-to-end mapping-based approach needs large amounts of data to
ity and diversity in the target accent domain. Meanwhile, it makes         achieve a good zero-shot ability and the auxiliary VC model usually
it easier for the conversion part to just learn the pronunciation pat-     needs to be fine-tuned on the AC data set to alleviate the error caused
tern/phoneme difference with a small amount of weakly parallel data        by voice conversion step. These methods also constrain the output
which is not constrained to the same speaker.                              to be generated with the same length of the input which limits the
    Both of the stages are seq2seq tasks based on decoder-only Trans-      conversion quality since the prosody of the speech is largely affected
former architectures [22]. Inspired by the ideas from machine trans-       by the accent.
lation community to reduce the need for supervision, we leverage the           Another approach is to regard accent conversion as a decompo-
BART/T5-style pre-training [12] to significantly reduce the amount         sition and resynthesis task in which the accent is separated from
of parallel supervision required to train the conversion part. Such pre-   content, speaker identity, prosody and resynthesize to the target
training with a pretext task on target-accent data is designed to learn    waveform in a TTS manner. [15] disentangles different features in
the pattern of generating semantic units, e.g. the joint probability       multi-stage with several off-the-shelf models. Specifically, an accent-
of phonetic units in target accent space. Furthermore, to reduce the       robust ASR model is trained using source accent speech with text
complexity and latency of the speech generation process, we design         labels to separate the source accent from the content. A multi-speaker
a single-stage autoregressive generative model which generates all         TTS model with a global speaker encoder is trained with a large
vector quantizers(VQs) in one step based on TF-Codec [10].                 corpus of target-accent speech to map the accent-agnostic linguistic
    Our contributions:(i)We propose a state-of-the-art generative-         features to acoustic features with the voice of source speaker and
based framework for accent conversion which is capable of convert-         target accent prosody. Similarly, [33] learns the semantic embed-
ing prosody pattern as well as pronunciation units, as evaluated with      dings directly from the accent speech with the supervision of text
objective and subjective metrics on public Indian-English accent           embeddings extracted from text labels. Such two-stage mapping-
to general American-English accent conversion test set. (ii)We use         based non-parallel AC approach relieves the burden on the parallel
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision             MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.
data but needs to leverage large amounts of (text-accent speech) data      prosody is re-synthesized in target accent with a speech generative
and dedicated auxiliary tasks. With regard to performance, these           model. Specifically, we use a pre-trained self-supervised speech
methods are not good enough in conversion quality and diversity. It        representation model, e.g. HuBERT[6] to extract discrete semantic
also suffers from poor zero-shot generalization ability with limited       tokens. A neural codec based speech generative model is used to
high-quality data available. The same accent speaker is used in their      generate the acoustic codes of the codec conditioned on the con-
training and testing. Another work [9] treats the decomposition and        verted semantic tokens with the style prompt to maintain the source
resynthesis in an end-to-end manner. It designs a Pseudo Siamese           speaker’s voice. Both the conversion and generative models are
Disentanglement Network (PSDN) with two streams in which one               based on autoregressive decoder-only Transformer structure. More
stream is used to learn the acoustic feature of target accent speech       details are discussed in Section 3.2 and Section 3.3.
and the other auxiliary stream is used to build the information gap
with the target stream to disentangle the content with accent, com-        3.2    Semantic token conversion
plemented with another adversarial accent classifier with gradient
                                                                           The conversion module is designed as a seq2seq task in discrete
reversal layer(GRL). This framework can be used in the zero-shot
                                                                           semantic token space in which the source accent semantic tokens
scenario but the performance is not clear since their demo page is
                                                                           are converted to the target accent semantic tokens iteratively in
out-of-date and the metrics in their paper can not be compared with
                                                                           an autoregressive manner. To handle the shortage of parallel data,
other public works.
                                                                           inspired by BART and T5[12, 20], we use large amounts of target
   Compared to prior AC approaches, the proposed generative frame-
                                                                           accent data to pre-train the conversion module with a pretext task in
work neither requires large amounts of parallel data spoken by the
                                                                           our scenario. We then fine-tune the conversion module with a small
same speaker nor supervision from text labels or auxiliary tasks to
                                                                           amount of weakly parallel data.
achieve a good conversion quality.
                                                                              Pre-training. In our scenario, the pretext task is designed to
                                                                           build the probability space of discrete semantic tokens in the target
2.2    Speech generative models                                            accent domain so that the target accent semantic tokens can be
Recently, the speech generative models show large potential in gen-        generated according to its context of previous tokens in the target
erating contextual consistent, natural and diverse audio/speech based      accent domain in a closed-loop manner. In this pretext task, the
on a speech neural codec with the in-context learning of a referenced      model is trained in a self-supervised manner which is to produce
prompt. AudioLM[2], designed for zero-shot audio generation, uses          the original token sequence 𝑌 = {𝑦0, ...𝑦𝑡 }, 𝑡 < 𝑇 conditioned on the
Soundstream[27] codes as intermediate representation of acoustic           corrupted token sequence𝑌¯ = {𝑦¯0, ...𝑦¯𝑡 }, 𝑡 < 𝑇 , formulated as
features for speech synthesis. It also shows the strong ability of
in-context learning with a short prompt to maintain acoustic informa-                                            𝑇
                                                                                                                 Ö
tion such as speaker identity, prosody style and acoustic environment                      𝑝 (𝑌 |𝑌¯ ; 𝜃 𝐴𝑅 ) =          𝑝 (𝑦𝑡 |𝑦 <𝑡 , 𝑌¯ ; 𝜃 𝐴𝑅 )   (1)
in the continuations. VALL-E[23], verified in the TTS task, based                                                𝑡 =0
on Encodec[4] tokens has also shown better zero-shot ability, speech          We have experimented with corruptions like token masking, token
naturalness and diversity than non-generative based TTS models.            deletion, and token in-filling and we find the token in-filling scheme
   For model structure, these models take the decoder-only auto-           works the best. Specifically, following the text in-filling scheme
regressive Transformer structure[22] to build the conditional correla-     in BART[12], a number of text spans are sampled according to
tion of the acoustic features tokenized by a neural speech codec and       𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑖 (𝑝) where 𝑝 = 0.5. The span lengths are drawn from a
the semantic tokens/phoneme sequences. When generating codes,              Poisson distribution(𝜆 = 5). We train the pretext task with large
they usually need multiple stages since the neural codecs they use         amounts of target accent data which is available in the public corpus.
are built on residual vector quantization(RVQ) which consists of a hi-        Fine-tuning. Since some phonemes in the source accent need
erarchy of all VQs. In AudioLM, the first several quantizer layers are     to be converted to the target accent ones, a mapping between these
predicted in the first stage to get the coarse information of the speech   phonemes needs to be learned. Specifically, we fine-tune the pre-
and the rest layers are predicted based on the coarse layers to get the    trained conversion model conditioned on the semantic tokens in
fine details of the speech. VALL-E simplifies the generating process       source accent with a small amount of weakly parallel accent data.
by replacing the second stage with non-autoregressive(NAR). In its         Correspondingly, the training can be formulated as
design, the first quantizer is generated with the AR model and the
others are generated with the NAR model(all frames are predicted si-                                             𝑇
                                                                                                                 Ö
multaneously when predicting each codebook) based on the previous                         𝑝 (𝑌 |𝑋 ; 𝜃 𝐴𝑅 ) =            𝑝 (𝑦𝑡 |𝑦 <𝑡 , 𝑋 ; 𝜃 𝐴𝑅 )    (2)
quantizers.                                                                                                      𝑡 =0
   Different from these existing generative models, we propose a           in which 𝑋 = {𝑥 0, ...𝑥𝑡 }, 𝑡 < 𝑇 is the source accent semantic token
one-stage AR generative model which generates all VQs in one step          sequence and 𝑌 = {𝑦0, ...𝑦𝑡 }, 𝑡 < 𝑇 is the target accent semantic
to achieve lower complexity and latency as well as better quality.         token sequence.
Figure 1: Proposed framework. the source accent semantic tokens are converted to target accent semantic tokens in the first stage and
the speech is generated with target accent prosody conditioned on the converted semantic tokens in the second stage. The style prompt
is extracted from the first 3 seconds of the source speech. TF-Codec token is a group of concatenated embeddings of each quantizer.
generates acoustic tokens of TF-Codec iteratively through a single-                    in which 𝑌 = {𝑦0, ...𝑦𝑡 }, 𝑡 < 𝑇 is the semantic token sequence from
stage causal speech generation, conditioned on the converted/target                    target accent speech. 𝐶 : is TF-Codec token sequence of target accent
accent semantic tokens.                                                                speech. 𝐶e: is the TF-Codec token sequence of style acoustic prompt.
                                                                                       We do not distinguish 𝐶e: from 𝐶 : in training. The concatenation of
3.3.1 Speech tokenizer with TF-Codec. We use the pre-trained                           𝐶e: and 𝐶 : is a whole sequence. During inference, the first 3 seconds
causal speech neural codec TF-Codec to extract the acoustic token of                   of the source speech is used as 𝐶e: .
each frame. Unlike [10], we remove the predictive loop and use the
non-predictive model at 6 kbps for efficient acoustic modeling with
high-quality output. Specifically, the TF-Codec takes the 16kHz
magnitude-compressed time-frequency spectrum with a window
                                                                                       4     Experiments
length of 20 ms and a hop length of 5 ms as input. Then a stack                        To evaluate the performance of the proposed framework, we take
of 2D causal convolutional layers, followed by a temporal convo-                       Indian-English as source accent and general American-English as
lutional module (TCM) and a gated recurrent unit (GRU) block is                        target accent which is a common scenario in the research literature.
used to capture the short-term and long-term temporal dependencies
between the input frames in a causal manner. For the quantization,
it combines 4 frames together, producing a frame rate of 50 Hz for                     4.1    Experimental Setup
quantization. Instead of using RVQ, it employs group quantization                      Dataset. To train the speech generative model and pre-train the
where the latent embedding is split into 𝐾 groups and each group                       conversion model, LibriTTS dataset [28] is used as our training data.
is quantized by a vector quantizer with a codebook of 1024 code-                       The dataset contains approximately 585 hours of general American-
words. All 𝐾 acoustic codes are concatenated and decoded to get the                    English speech data, sourced from audiobooks available on the
reconstructed waveform.                                                                public LibriVox project. To fine-tune the conversion model, L1-L2
                                                                                       ARCTIC dataset [11, 32] is used. L1-L2 ARCTIC dataset is a dataset
3.3.2 Single-stage causal speech generation. As the group
                                                                                       with accent speakers speaking the same content. To build the parallel
quantization in TF-Codec encodes each group independently, we
                                                                                       data, we select a general American-English speaker named "bdl" as
leverage a single-stage causal speech generation to generate acoustic
                                                                                       the target accent speaker and "ASI" as the Indian-English speaker.
codes of all 𝐾 quantizers simultaneously for each frame. As shown
                                                                                       Among all their utterances, 1000 utterances, about 50 minutes of
in Figure 1, TF-Codec token, which is the concatenated embed-
                                                                                       speech are used in the training, 50 utterances are used in validation
dings corresponding to all 𝐾 quantizers, is generated in one-stage
                                                                                       and the remaining 100 utterances are used for testing. To better
autoregressive manner conditioned on the target accent/converted se-
                                                                                       verify the zero-shot ability, we also add speaker p248 from VCTK
mantic tokens and style acoustic tokens. For each group embedding
                                                                                       dataset [25] and another 4 Indian-Englsh speakers which are not
in TF-Codec token, the dimension is 𝐷𝑡𝑜𝑘𝑒𝑛 /𝐾, in which 𝐷𝑡𝑜𝑘𝑒𝑛
                                                                                       used in training from L1-L2 ARCTIC dataset into testing. To be
is the dimension of the embedding in transformer. 𝐾 classification
                                                                                       noted that the 20 utterances from speaker p248 in VCTK are used to
heads are employed to predict the 𝐾 acoustic codes for current frame
                                                                                       compare with the existing machine-learning based AC method [15].
separately. The training target can be formulated as
                                                                                       Besides these 20 cases, we add another 20 utterances of each testing
                                          𝑇                                            speaker in L1-L2 ARCTIC for objective evaluation, e.g. 120 cases in
                                          Ö
             𝑝 (𝐶 : |𝑌 , 𝐶e: ; 𝜃 𝐴𝑅 ) =          𝑝 (𝑐𝑡 |𝑐 <𝑡 , 𝑌, 𝐶e: ; 𝜃 𝐴𝑅 )   (3)   total, and random 8 utterances per speaker for subjective evaluation,
                                          𝑡 =0                                         e.g. 60 cases in total.
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision                           MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.
Table 1: Evaluation on VCTK test set(20 cases from speaker p248 as Liu. et al’s). SPK of accent source is computed on different
utterances of the source speaker.
Table 2: Evaluation on L1-L2 ARCTIC test set. LCSR of ground truth speech is calculated between ground truth utterances of different
speakers.
   Model and configuration. For semantic tokenizer, we employ                          model. For each case, we infer 5 times and select the one with best
the HuBERT-Base model1 and k-means algorithm with 500 clus-                            LCSR metric as the choice.
ters to extract semantic tokens. It is trained on LibriSpeech[18]                         Baseline models. To show the superiority of the proposed frame-
mostly consisting of general American-English. It generates a dis-                     work, we select 3 models as our baselines. The existing machine-
crete semantic token sequence at 50Hz framerate for 16kHz audio.                       learning based AC method [15], which is the best model available
Previous studies[7, 19] show that HuBERT is a good representa-                         in the public to our knowledge. The generative-only models with-
tion of speech content and removes most of the speaker identity so                     out the conversion module are also used as our baselines, in which
that we can use the weakly parallel data in the fine-tuning stage of                   we compare with the commonly-used EnCodec-based multi-stage
the conversion module. For the acoustic tokenizer, the number of                       generative model and the proposed single-stage TF-Codec based
quantizers in TF-Codec (𝐾) is set to 16. The transformer used in                       generative model.
conversion model and generative model is the same structure with                          Evaluation methods on accent similarity. To evaluate the perfor-
12 layers of 16 attention heads, a feed-forward layer with dimen-                      mance of accent conversion, both objective and subjective metrics
sion of 4096, and a dropout layer with rate of 0.1. The embedding                      are used. Intuitively, we use the metric Longest Common Subse-
dimension in transformer(𝐷𝑡𝑜𝑘𝑒𝑛 ) is 1024. The generative model                        quence(LCS) to evaluate the similarity of the converted semantic
and pre-training stage of conversion model are trained on 8 NVIDIA                     token sequence and referenced target semantic token sequence. To
TESLA V100 32GB GPUs with a batch size of 4k tokens per GPU.                           eliminate the disturbance of the duration of each word, the dupli-
The ScaledAdam[26] optimizer is used. The learning rate is set to                      cated tokens in the sequence are removed. To remove the effect
0.01, with a warmup for the first 5k steps and decays exponentially.                   of the utterance length in the final average statistics, the Longest
The speech generative model is trained for 500k steps and the con-                     Common Subsequence Ratio (𝐿𝐶𝑆𝑅 = 𝐿𝐶𝑆/𝑢𝑡𝑡𝑒𝑟𝑎𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ) is
version model is trained for 100k steps. The fine-tune stage of the                    used. The smaller utterance length is used to calculate the LCSR of
conversion model is processed on one GPU of NVIDIA Tesla A100                          the testing pair. We also use the latest state-of-the-art English accent
80GB, with a batch size of 20k tokens. The same optimizer is used.                     classification model CommonAccent2 [34] to identify the accent of
The learning rate is set to 2 × 10 −5 , with a warmup for the first 160                the synthesized speech. Besides, we conduct a subjective A/B testing
steps. The fine-tuning of the model is trained for 1k steps. During                    in which participants are asked to choose the one that sounds more
inference, we employ Top-𝑘 algorithm to generate each token, in                        close to general American-English accent in an A/B pair. Each A/B
which 𝑘 = 2 for conversion model and 𝑘 = 10 for speech generative                      pair contains cases chosen from any two of the competitors. The ac-
                                                                                       cent source and ground truth are also included to ensure the validity
1 https://huggingface.co/facebook/hubert-base-ls960                                    2 https://huggingface.co/Jzuluaga/accent-id-commonaccent_ecapa
MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.                                       Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu
Figure 2: Accent classification results for VCTK test set, evaluated        Figure 3: Accent classification results for L1-L2 ARCTIC test set,
by CommonAccent.                                                            evaluated by CommonAccent.
of the testing. To remove the potential factor of speaker identity in the    effectiveness of the conversion module. This is also verified by the
subjective testing, in Table 2, both the accent source and referenced        LCSR metric on L1-L2 ARCTIC test set in Table 2, where the
ground truth are chosen from multiple speaker’s utterances, e.g. 5           proposed framework closely approaches the ground truth LCSR
Indian-English speakers and 4 general American-English speakers              after conversion. The analysis on HuBERT with accent input in
from L1-L2 ARCTIC test set. To be noted, the referenced ground               Section 4.5.1 also shows the HuBERT tokens are affected by the
truth is used here with the same sentence spoken by different speak-         source Indian-English accent, especially for those accent speech with
ers. Particularly, the participants are trained to distinguish the accent    phonetic changes, indicating the necessity of the semantic conversion
difference by listening to several pairs of <Indian-English, general         module. Figure 2 and Figure 3 show the accent classification results
American-English> samples before the formal testing. 20 partici-             from CommonAccent. Compared with other methods, the proposed
pants who are proficient in general American-English are invited             framework converts most of the Indian-English accent input to the
to conduct these evaluations. MOS-Accent, the percentage of being            target general-American English accent. In Figure 3, for the rest of
selected as general American-English accent is used as the metric of         11% cases which are identified as Indian-English accent, besides
this subjective testing.                                                     classification error, most of the cases are short and the conversion
   Evaluation methods on speaker similarity and speech quality.              quality of some cases is not good enough which leaves room for
To evaluate the speaker identity maintenance, the speaker similar-           further improvement on the robustness of the generative framework.
ity metric is calculated as the cosine similarity of the two speaker         For Liu. et al [15]’s method, it is interesting to find that there is a
vectors, which are extracted from the source accent speech and the           big gap between classification metric(as bad as accent source) as
converted speech, correspondingly. WavLM-TDNN3 [3], a state-of-              shown in Figure 2 and MOS-Accent(relatively good) as shown in
the-art speaker verification model, is used to get the speaker vector        Table 1. We think the bad result in terms of classification metric
from a speech. To evaluate the naturalness, we use NISQA-TTS[16],            comes from its poor conversion ability on the pronunciation units.
which is commonly used for synthesized speech. We also conduct               Most of the pronunciation units are not converted well, e.g. the
MOS testing, in which the raters are asked to give a score ranging           pair of ( ′𝑏 ′,′ 𝑝 ′ ). For prosody conversion, the quality is relatively
from 1 (lowest quality) to 5 (highest quality) according to the overall      good compared with the source accent, which contributes to the
subjective quality. The MOS-Naturalness with confidence level of             high subjective metric. Figure 4 shows an example of the pitch
95% is used as the metric.                                                   contour and phoneme improvements of the converted speech. The
                                                                             audio samples can be found in our demo page. Overall, the proposed
4.2     Results                                                              framework achieves much better accent conversion performance as
Accent similarity. As shown in Table 1 and Table 2, the MOS-                 verified by both objective classification metric and subjective metric.
Accent metric of the proposed framework on both datasets ranks                  Speech quality and Speaker similarity. According to the
the highest and very close to the ground truth. Compared with the            NISQA-TTS and MOS-Naturalness metric in Table 1 and Table 2,
generative-only models, e.g. Generative-only model(EnCodec) and              the proposed framework ranks at the top level. Compared with Liu’s
Generative-only model(TF-Codec), the proposed framework highly               method, the proposed framework achieves much better speech qual-
surpasses them for accent conversion performance, indicating the             ity and speaker similarity. Artifacts of Liu’s model can be found
                                                                             in some cases, as shown in the demo page. What’s more, we find
3 https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_
                                                                             the better speech quality and speaker similarity can be achieved
verification
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision            MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.
Table 3: Complexity comparison of speech generative module.               Table 4: Accent conversion quality with parallel data of 50 mins,
                                                                          30 mins, 15 mins.
                                   Model
         Framework                                Decoding steps(/s)
                               parameters(M)                                 Parallel data amount                MOS-Accent(↑)
       Generative-only                                                       50 mins                                  59.3%
                                   262.3           75 AR + 7 NAR
      model(EnCodec)                                                         30 mins                                  58.1%
       Generative-only
                                   100.8                50 AR                15 mins                                  56.8%
      model(TF-Codec)
with TF-Codec based generative model according to the compar-             4.5   Supportive analysis
ison of Generative-only model(TF-Codec) and Generative-only               4.5.1 HuBERT tokens from accent speech. In this section, we
model(EnCodec). This can also be verified by our demo cases. Com-         evaluate how accent affects the HuBERT tokens. Specifically, we
pared with Generative-only model(TF-Codec), the SPK value of              build a parallel data set from L1-L2 ARCTIC dataset in which the
the proposed framework drops a bit but the subjective judgement           Indian-English speaker and general American-English speaker speak
on the demo cases are quite similar. We think this is caused by           the same content. Both of them are fed into the HuBERT model used
the error from the speaker vector extractor WavLM-TDNN. Since             in the paper to get the semantic token sequence. The LCSR metric is
WavLM-TDNN is not trained on accent speech so the extracted vec-          used to evaluate the content similarity between these two HuBERT
tor contains not only the speaker identity but also accent information.   token sequences. 1000 pairs are used in this experiment. To further
So with better accent conversion, the speaker vector of the con-          study the phoneme change effect, the accent cases are divided into
verted speech and the accent source speech tend to be more different,     the one with phoneme changes and without phoneme changes. Since
resulting in a lower cosine similarity. It should be more reason-         the speakers of each pair are different, the effect of the speaker
able to use this metric to compare Generative-only model(EnCodec)         identity on the LCSR metric is also calculated as a reference. As
with Generative-only model(TF-Codec) and the proposed with Liu’s          Table 5 shows, with the source accent introduced, HuBERT tokens
method since both of them are with/without accent leak in the con-        have changed a lot, degrading from 0.747 to 0.569 in terms of LCSR.
verted speech.                                                            For those cases with specific phoneme changes, more tokens have
                                                                          been changed from the target accent references(LCSR: 0.541).
4.3    Efficiency of single-stage causal speech
                                                                          4.5.2 Accent effect on the style prompt in speaking mod-
       generation
                                                                          ule. We use the accent source as the style prompt in the speech
Here we compare the complexity of the proposed single-stage causal        generative model to extract speaker identity of the source speaker.
speech generation scheme based on TF-Codec with the multi-stage           This section is to evaluate if the accent feature will be extended
speech generation scheme based on Encodec. In the Encodec-based           through the in-context learning in the speech generative model. We
generative models, two stages are usually taken, with a combination       conduct empirical study to substantiate such usage. Specifically,
of autoregressive(AR) stage to generate the first quantizer and NAR       we design an A/B testing to compare the accent similarity of the
stage to generate the rest of the quantizers of all time steps based      synthesized speech generated with two kinds of prompts in differ-
on the previous quantizers. The Encodec used in the experiment is         ent accents conditioned on the same content. For testing, we take
composed of 8 quantizers with the frame rate of 75 Hz and sample          Indian-English and general American-English as two prompt types
rate of 24kHz. The complexity is shown in Table 3 in terms of model       for comparison. We build 100 pairs of samples to test. Each pair
parameters and decoding steps. According to Table 3, TF-Codec             contains an utterance in general American-English accent which is
based generative model saves more than 50% in model size and              used to extract HuBERT semantic tokens, an utterance from a gen-
takes a pure causal decoding scheme with fewer steps. The overall         eral American-English speaker and from an Indian-English speaker
RTF is 2.1 with an NVIDIA RTX A6000 GPU, 3s of the prompt.                working as the style prompts. The prompts are cut to 3 seconds.
                                                                          Examples can be found in our demo page. For subjective testing, 20
4.4    Training with minimum supervision                                  participants who are college students majoring in American-English
In this section, we further reduce the parallel data used in the fine-    are asked to distinguish the two synthesized speech and choose the
tuning stage of the conversion model, from 50 minutes (proposed           one which sounds closer to general American-English. The percent-
in Table 1 and Table 2) to 30 minutes and 15 minutes, respectively.       age of being selected as general American-English accent is used as
We use the MOS-Accent as the evaluation metric and test on the            the evaluation metric. We also use the CommonAccent to identify
VCTK test set. We also add the baseline models into A/B testing for       the two synthesized speeches. As Table 6 shows, no matter general
comparison. As shown in Table 4, by decreasing the data amount,           American-English or Indian-English prompt type, the percentage
the performance drop is negligible. With the minimum supervision          to be selected as general American-English by users is about 50%
of 15 minutes, the performance is still relatively good, which shows      and almost all the synthesized speeches are identified as general
its high potential for extension on other accents with low-resource       American-English. Furthermore, we find accent prompts do have
data, such as Chinese-English accent and Korean-English accent to         effect on the prosody modeling but are quite limited. According
general American-English accent cases in Section 4.7.                     to our experiments on the effect of accent prompt length, with the
                                                                          prompt in Indian-English accent becomes longer, increased from 3s
MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.                                       Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu
Table 5: HuBERT tokens from accent speech. Lower LCSR means less similarity between source accent speech and target accent
speech. Source accent: Indian-English. Target accent: general American-English.
Table 6: Comparison of the synthesized speech with two accent prompt types: general American-English and Indian-English. Com-
monAccent metric shows the percentage of being predicted as general American-English. A/B testing metric shows the percentage of
being selected as general American-English in the A/B pair.
to 7s, the percentage of predicted general American-English accent           model. The experiments show LCSR of generative-only model with-
drops from 84% to 73%. So we can use 3 seconds of accent source              out decoupling on L1-L2 ARCTIC test set will be largely dropped
as a prompt to catch the source speaker’s identity without bringing          to 0.02 (0.622 for the proposed), indicating most of the content has
the source accent back to the converted speech.                              been destroyed and the model fails to learn such mapping with so
                                                                             little amount of parallel data.
                                                                                 The effect of pre-training for semantic conversion module To
                                                                             verify the validity of the language pre-training technology used for
                                                                             the semantic conversion module, we compare it with the solution
                                                                             where the conversion model is trained from scratch with the weakly
                                                                             parallel data. All parallel data(about 50 mins) are used for training.
                                                                             Without pre-training, the results degrade by a large margin to 0.103
                                                                             in LCSR(0.622 for the proposed). This is reasonable since the pre-
                                                                             training stage lays a good foundation for the fine-tuning stage to just
                                                                             focus on learning few semantic units which are different in the two
                                                                             accents.
                                                                             5     Conclusions
                                                                             In this work, we propose a two-stage generative framework for
                                                                             accent conversion task in which the conversion is operated on the
                                                                             semantic token level and synthesized to a target accent speech with
                                                                             TF-Codec based generative model. Experimental results show the
Figure 4: An example of pitch contour and phonemes improve-                  proposed framework achieves the state-of-the-art performance in
ment. (Content: "It’s also very valuable.")                                  terms of accent similarity, speech quality and speaker maintenance
                                                                             with limited parallel data. With the language pre-training technology,
                                                                             only 15 minutes of parallel data, not constrained to the same speaker
                                                                             reaches to a good conversion quality, which shows large potential
4.6    Ablation Study                                                        for an easy extension for other accents with low-resource data. The
Decoupling design We compare with the solution in which the par-             proposed single-stage AR generative model achieves better speech
allel data is used to fine-tune the generative-only model directly. In       quality at lower complexity, which can be used for other speech
such a way, the model is guided to learn the phoneme and prosody             generative tasks. In the future, we will further improve the robustness
conversion simultaneously and blindly through an AR Transformer              of the generative framework for AC task.
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision                                 MM ’24, October 28–November 1, 2024, Melbourne, VIC, Australia.
References                                                                                  [18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015.
 [1] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020.                    Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE
     wav2vec 2.0: A framework for self-supervised learning of speech representations.            international conference on acoustics, speech and signal processing (ICASSP).
     Advances in neural information processing systems 33 (2020), 12449–12460.                   IEEE, 5206–5210.
 [2] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier             [19] Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-
     Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier,                  Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. 2021. Speech resyn-
     Marco Tagliasacchi, and Neil Zeghidour. 2023. AudioLM: A Language Modeling                  thesis from discrete disentangled self-supervised representations. arXiv preprint
     Approach to Audio Generation. IEEE ACM Trans. Audio Speech Lang. Process.                   arXiv:2104.00355 (2021).
     31 (2023), 2523–2533.                                                                  [20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
 [3] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen,                   Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits
     Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou,                   of transfer learning with a unified text-to-text transformer. The Journal of Machine
     Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and                   Learning Research 21, 1 (2020), 5485–5551.
     Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack         [21] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly,
     Speech Processing. IEEE J. Sel. Top. Signal Process. 16, 6 (2022), 1505–1518.               Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.
 [4] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High                 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predic-
     Fidelity Neural Audio Compression. CoRR abs/2210.13438 (2022).                              tions. In 2018 IEEE international conference on acoustics, speech and signal
 [5] Shaojin Ding, Guanlong Zhao, and Ricardo Gutierrez-Osuna. 2022. Accentron:                  processing (ICASSP). IEEE, 4779–4783.
                                                                                            [22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Foreign accent conversion to arbitrary non-native speakers using zero-shot learning.
                                                                                                 Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you
     Computer Speech & Language 72 (2022), 101302.
                                                                                                 need. Advances in neural information processing systems 30 (2017).
 [6] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan
                                                                                            [23] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie
     Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised
                                                                                                 Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Neural
     Speech Representation Learning by Masked Prediction of Hidden Units. IEEE
                                                                                                 codec language models are zero-shot text to speech synthesizers. arXiv preprint
     ACM Trans. Audio Speech Lang. Process. 29 (2021), 3451–3460.
                                                                                                 arXiv:2301.02111 (2023).
 [7] Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He,
                                                                                            [24] Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen
     and Zhou Zhao. 2022. Transpeech: Speech-to-speech translation with bilateral
                                                                                                 Meng. 2021. Vqmivc: Vector quantization and mutual information-based unsuper-
     perturbation. arXiv preprint arXiv:2205.12523 (2022).
                                                                                                 vised speech representation disentanglement for one-shot voice conversion. arXiv
 [8] Mark Huckvale. 2006. The new accent technologies: recognition, measurement
                                                                                                 preprint arXiv:2106.10132 (2021).
     and manipulation of accented speech. Beijing: Language and Culture Press.
                                                                                            [25] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR
 [9] Dongya Jia, Qiao Tian, Jiaxin Li, Yuanzhe Chen, Kainan Peng, Mingbo Ma,
                                                                                                 VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit
     Yuping Wang, and Yuxuan Wang. 2022. Non-parallel Accent Conversion using
                                                                                                 (version 0.92). https://doi.org/10.7488/ds/2645
     Pseudo Siamese Disentanglement Network. arXiv preprint arXiv:2212.05751
                                                                                            [26] Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang,
     (2022).
                                                                                                 Zengrui Jin, Long Lin, and Daniel Povey. 2023. Zipformer: A faster and better
[10] Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, and Yan Lu. 2023. Latent-
                                                                                                 encoder for automatic speech recognition. CoRR abs/2310.11230 (2023).
     Domain Predictive Neural Speech Coding. IEEE/ACM Transactions on Audio,
                                                                                            [27] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco
     Speech, and Language Processing (2023).
                                                                                                 Tagliasacchi. 2022. SoundStream: An End-to-End Neural Audio Codec. IEEE
[11] John Kominek and Alan W. Black. 2004. The CMU Arctic speech databases. In
                                                                                                 ACM Trans. Audio Speech Lang. Process. 30 (2022), 495–507.
     SSW. ISCA, 223–224.
                                                                                            [28] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng
[12] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
                                                                                                 Chen, and Yonghui Wu. 2019. LibriTTS: A Corpus Derived from LibriSpeech for
     Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denois-
                                                                                                 Text-to-Speech. In INTERSPEECH. ISCA, 1526–1530.
     ing sequence-to-sequence pre-training for natural language generation, translation,
                                                                                            [29] Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna. 2019. Foreign
     and comprehension. arXiv preprint arXiv:1910.13461 (2019).
                                                                                                 Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams.. In
[13] Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang,
                                                                                                 Interspeech. 2843–2847.
     Yuxuan Wang, and Zejun Ma. 2020. Improving accent conversion with reference
                                                                                            [30] Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna. 2021. Converting
     encoder and end-to-end text-to-speech. arXiv preprint arXiv:2005.09271 (2020).
                                                                                                 foreign accent speech without a reference. IEEE/ACM Transactions on Audio,
[14] Yist Y Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee.
                                                                                                 Speech, and Language Processing 29 (2021), 2367–2381.
     2021. Fragmentvc: Any-to-any voice conversion by end-to-end extracting and
                                                                                            [31] Guanlong Zhao, Sinem Sonsaat, John Levis, Evgeny Chukharev-Hudilainen, and
     fusing fine-grained voice fragments with attention. In ICASSP 2021-2021 IEEE
                                                                                                 Ricardo Gutierrez-Osuna. 2018. Accent conversion using phonetic posterior-
     International Conference on Acoustics, Speech and Signal Processing (ICASSP).
                                                                                                 grams. In 2018 IEEE International Conference on Acoustics, Speech and Signal
     IEEE, 5939–5943.
                                                                                                 Processing (ICASSP). IEEE, 5314–5318.
[15] Songxiang Liu, Disong Wang, Yuewen Cao, Lifa Sun, Xixin Wu, Shiyin Kang,
                                                                                            [32] Guanlong Zhao, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Evgeny Chukharev-
     Zhiyong Wu, Xunying Liu, Dan Su, Dong Yu, et al. 2020. End-to-end accent
                                                                                                 Hudilainen, John Levis, and Ricardo Gutierrez-Osuna. 2018. L2-ARCTIC: A
     conversion without using native utterances. In ICASSP 2020-2020 IEEE Interna-
                                                                                                 Non-native English Speech Corpus. In INTERSPEECH. ISCA, 2783–2787.
     tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
                                                                                            [33] Yi Zhou, Zhizheng Wu, Mingyang Zhang, Xiaohai Tian, and Haizhou Li. 2023.
     6289–6293.
                                                                                                 TTS-Guided Training for Accent Conversion Without Parallel Data. IEEE Signal
[16] Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebastian Möller. 2021.
                                                                                                 Processing Letters (2023).
     Nisqa: A deep cnn-self-attention model for multidimensional speech quality pre-
                                                                                            [34] Juan Zuluaga-Gomez, Sara Ahmed, Danielius Visockas, and Cem Subakan.
     diction with crowdsourced datasets. arXiv preprint arXiv:2104.09494 (2021).
                                                                                                 2023. CommonAccent: Exploring Large Acoustic Pretrained Models for Ac-
[17] Tuan Nam Nguyen, Ngoc-Quan Pham, and Alexander Waibel. 2022. Accent
                                                                                                 cent Classification Based on Common Voice. Interspeech 2023 (2023). https:
     Conversion using Pre-trained Model and Synthesized Data from Voice Conversion.
                                                                                                 //arxiv.org/abs/2305.18283
     In Proc. Interspeech, Vol. 2022. 2583–2587.