Ethiopian Multi

The document discusses end-to-end multilingual automatic speech recognition for less-resourced Ethiopian languages. It explores using data from other languages in a multilingual setup to address the problem of data scarcity for these languages. The results show that relative word error rate reductions of up to 29.83% can be achieved by just using data from two related languages to train end-to-end ASR systems.

Uploaded by

zerihun nana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views5 pages

Ethiopian Multi

Uploaded by

zerihun nana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-7605-5/20/$31.

00 ©2021 IEEE | DOI: 10.1109/ICASSP39728.2021.9415020

END-TO-END MULTILINGUAL AUTOMATIC SPEECH RECOGNITION FOR

LESS-RESOURCED LANGUAGES: THE CASE OF FOUR ETHIOPIAN LANGUAGES

Solomon Teferra Abate, Martha Yifiru Tachbelie∗ Tanja Schultz

CSL, University of Bremen, Germany CSL, University of Bremen, Germany

SIS, Addis Ababa University, Ethiopia tanja.schultz@uni-bremen.de
solomon.teferra, martha.yifiru@aau.edu.et

ABSTRACT sequence of grapheme or words, is an active research area

The End-to-End (E2E) approach, which maps a sequence of since it avoids the explicit use of linguistic knowledge (in
input features into a sequence of graphemes or words, to Au- the form of pronunciation dictionary and language model)
tomatic Speech Recognition (ASR) is a hot research agenda. that are required in the traditional ASR systems. Since the
It is interesting for less-resourced languages since it avoids performance of E2E systems could not outperform the hybrid
the use of pronunciation dictionary, which is one of the ma- HMM-DNN approach, researchers [3, 4] try to improve the
jor components in the traditional ASR systems. However, performance by integrating a language model. However, E2E
like any deep neural network (DNN) approaches, E2E is data ASR systems are still advantageous as they do not require pro-
greedy. This makes the application of E2E to less-resourced nunciation dictionary which requires linguistic expertise and
languages questionable. However, using data from other lan- is expensive to prepare. Thus, E2E framework is attractive
guages in a multilingual (ML) setup is being applied to solve for less-resourced languages as a pronunciation dictionary,
the problem of data scarcity. We have, therefore, conducted which is one of the important resources/components in both
ML E2E ASR experiments for four less-resourced Ethiopian HMM-DNN and HMM-GMM systems, is not required.
languages using different language and acoustic modelling However, like any other DNN approaches, E2E is train-
units. The results of our experiments show that relative Word ing data greedy. This affects its application for less-resourced
Error Rate (WER) reductions (over the monolingual E2E sys- languages that do not have the required speech and text data
tems) of up to 29.83% can be achieved by just using data of for the development of speech processing applications. But
two related languages in E2E ASR system training. More- the problem of training data scarcity has been addressed by
over, we have also noticed that the use of data from less re- the use of data from other languages in a ML approach.
lated languages also leads to E2E ASR performance improve-
ML ASR (MLASR) systems are useful in a number of
ment over the use of monolingual data.
ways, including the development of language agnostic speech
Index Terms— Ethiopian Languages, End-to-End ASR, technologies [5]. They are particularly interesting for less-
Deep Neural Networks, Modeling Units resourced languages where training data for the development
of the ASR systems are sparse or not available at all [6].
1. INTRODUCTION Consequently, various studies in MLASR [7, 8, 9, 10, 11,
Artificial Neural Networks (ANNs) have been used in Au- 12, 13, 2, 14, 15] have been conducted for several language
tomatic Speech Recognition (ASR) for about five decades. groups. The research trend shows that the use of DNNs results
However, their use resulted in dramatic improvement in ASR in better methods for the development of MLASR systems
performance since 2009. Numerous studies showed that [16, 17, 18]. ML systems in the E2E framework have also
hybrid Hidden Markov Model-Deep Neural Network (HMM- been and being investigated [19, 20, 21, 22]. The problem
DNN) systems outperform the dominant Hidden Markov of using the E2E approach for the development of MLASR
Model-Gaussian Mixture Model (HMM-GMM) on the same is that we should find common modeling units for the in-
data [1]. Similarly we have also achieved improvements of volved languages. Characters and words are not common
ASR performance for four Ethiopian languages (Amharic, across many languages. Researchers have, therefore, inves-
Oromo, Tigrigna and Wolaytta) as a result of using hybrid tigated the use of different units such as bytes [19].
HMM-DNN [2] over the HMM-GMM based ASR systems. In this paper, we present ML E2E ASR experiments we
Recently, the End-to-End (E2E) framework, which di- have conducted for four less-resourced Ethiopian languages,
rectly maps a sequence of input acoustic features into a namely, Amharic, Oromo, Tigrigna and Wollaytta. To our
∗ The authors would like to thank the Alexander von Humboldt Founda- knowledge, such ML E2E investigation has not been con-
tion for research fellowship. ducted for these languages before and therefore the perfor-

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 24,2021 at 14:36:05 UTC from IEEE Xplore. Restrictions apply.
mance of ML E2E ASR systems of these languages is not For Amharic, we have used two corpora: AMH2005 and
known. In our experiments, we investigated the use of dif- AMH2020. AMH2005 [24] is a read speech corpus that
ferent modeling units (characters and phones) in ML E2E contains 20 hours of training speech (11k utterances), de-
acoustic modeling and the use of speech data from closely re- velopment (dev) and test sets read by 20 other speakers (10
lated languages as well as GlobalPhone [23] languages, con- each). The domain of this corpus is broadcast news and the
sidering the four Ethiopian languages as targets. Moreover, recording was done in a noise free environment. Moreover,
we have experimented by combining text data of related lan- the maximum length of the utterances is limited to 20 words.
guages for ML language model (LM) training. AMH2020 together with the corpora of other three
The next section provides a description of the four Ethiopian languages have been collected in Ethiopia [25].
Ethiopian languages, which are target languages in our exper- The AMH2020, Tigrigna and Oromo speech corpora consist
iments. Section 3 describes the data used in our experiments. of speech from 98 speakers each while the Wolaytta corpus is
Section 4 presents the results of the baseline monolingual read by 85 speakers. For each of these corpora development
E2E systems using characters and phones as modeling units and evaluation sets (speech of 4 speakers per set, ranging
for the four target languages. In Section 5, the results of ML from 1 hour to 1.7 hours) have been held out from the total
E2E systems are presented. Finally, in Section 6 conclusions recordings. The training speech sizes are 24, 22.1, 22.8 and
and a way forward are provided. 29.7 hours for AMH2020, Tigrigna, Oromo and Wollaytta,
2. TARGET LANGUAGES respectively. The domain of these corpora includes broadcast
news, the bible, other religious books, etc. The recordings
In our experiments, we have considered four Ethiopian lan- were done using smartphones in different environments and
guages (Amharic, Oromo, Tigrigna and Wolaytta) as targets. as a result, they are not as clean as the AMH2005 corpus.
Amharic and Tigrigna from the Semitic language family In addition, there is no limit with the maximum length of
while Oromo and Wolaytta are from Cushitic and Omotic the utterances when sentences are selected. For details on
language families, respectively. Ethiopian languages corpora, we direct readers to [25].
Amharic and Tigrigna are written in the Ethiopic script We have used the dev and test sets of these speech corpora
known as fid@l. It is syllabic since each symbol represents a for tuning and testing, respectively. LM weighs are tuned
consonant and a vowel. Each of the core consonants has seven using the transcriptions of the dev sets of each speech cor-
shapes or orders according to the vowels combined with them. pus, which consists of 760 sentences for Amharic AMH2005,
The writing system of Oromo and Wolaytta uses the Latin 507 for AMH2020, 511, 505 and 553 sentences for Tigrigna,
script. In these languages almost all the consonants could be Oromo and Wolaytta corpora, respectively.
in geminated and non-geminated forms. Current writers dif-
ferentiate geminated and non-geminated consonants as well 3.2. Globalphone
as long and short vowels. The numbers of graphemes cov- GlobalPhone (GP) is a ML corpus consisting of speech data,
ered in the corpora (without counting the geminated forms) corresponding transcriptions and pronunciation dictionaries
are 233 (AMH2005) and 225 (AMH2020) for the Amharic covering the vocabulary of the transcripts. Currently, the GP
corpora, while they are 247, 26 and 27 for Tigrigna, Oromo corpus covers 22 languages, i.e. Arabic (modern standard),
and Wolaytta, respectively. Bulgarian, Chinese (Mandarin and Shanghai), Croatian,
Amharic and Tigrigna have 28 and 31 consonants, re- Czech, French, German, Hausa, Japanese, Korean, Polish,
spectively and both have the same 7 vowels. The Amharic Portuguese, Russian, Spanish, Swahili, Swedish, Tamil, Thai,
phones are a subset of Tigrigna phones. Similarly, Oromo Turkish, Ukrainian, and Vietnamese. Detailed description of
and Wolaytta use the same 5 vowels that come in long and GP can be found [23].
short variants, while they have 28 and 26 consonants, respec-
tively. Although they belong to different language families, 4. BASELINE MONOLINGUAL E2E ASR SYSTEMS
Oromo and Wolaytta also share a number of consonants too.
As baselines in the E2E framework, we have developed
3. THE CORPORA monolingual E2E ASR systems for each of the target lan-
We used five corpora (two corpora for Amharic) of the four guages (Amharic, Oromo, Tigrigna and Wolaytta) using the
target languages. In addition, GlobalPhone, a ML database of training corpus of each language and the ESPRESSO E2E
speech and text for 22 languages [23], has been used as source ASR toolkit. ESPRESSO is an open-source, modular, exten-
data. The following is a brief description of the corpora used sible E2E neural ASR toolkit developed based on PyTorch
in our experiments. deep learning libraries and the FAIRSEQ toolkit [26].
Literature shows that the performance of a pure E2E ASR
model, i.e. without an external LM component, is far from
3.1. Ethiopian languages corpora
satisfactory [3, 4]. Consequently, ESPRESSO provides the
Read speech corpora of four Ethiopian Languages (Amharic, possibility of external LM integration employing shallow fu-
Tigrigna, Oromo and Wolaytta) are used in our experiments. sion. It supports integration of three types of neural LMs:

7014

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 24,2021 at 14:36:05 UTC from IEEE Xplore. Restrictions apply.
character-based, look ahead word based and multi-level (a
Table 1. CER, PER and WER of Monolingual E2E ASR
combination of character based and word based LMs) LMs
[26]. We used character-, phone- (when phones are used in Languages DNN Word LM Char LM Phone LM
acoustic modeling) and word based LMs. WER CER WER CER WER PER WER
AMH2005 23.05 9.29 26.28 7.63 19.81 5.16 19.05
For language modeling, we have used different sizes of AMH2020 - - - 14.65 36.43 8.84 29.70
text corpus obtained from the web, except for Wolaytta. Since TIR 26.94 10.32 27.27 9.48 25.55 6.18 22.18
we found no text resources on the web for Wolaytta, only ORM 32.28 10.11 32.13 9.53 30.36 10.83 30.36
the training transcription has been used to train LMs. For WAL 23.23 9.23 25.81 8.64 23.35 9.24 24.11
Amharic, Tigrigna, Oromo and Wolaytta we have used text
data consisting of about 4M, 4M, 1.2M and 226K word to-
based systems for each of the languages. However, when
kens, respectively. These text data are used for language mod-
character based LMs are fused, the E2E ASR systems out-
eling in the ML experiments too.
perform the HMM-DNN based systems, except for Wolaytta.
For each of the languages, we have developed Long Short
This is due to the morphological complexity of the languages
Term Memory (LSTM) based LMs following the wsj recipe
that increases the out-of-vocabulary (OOV) rate in the word
provided in ESPRESSO. The default hyper-parameters of the
based LMs. This can be avoided in character based LMs.
recipe have been used. For the word based LMs, the number
For Wolaytta, since the training text is very small (we used
of decoder layers used is 3 and the vocabulary size of LMs
only the training transcription consisting of 226K tokens),
is 32.5K for Amharic and Tigrigna while it is 21K and 25K
the character-based LM did not bring improvement over the
for Oromo and Wolaytta, respectively. We have tried to use
HMM-DNN based as well as E2E systems that use word-
larger vocabularies but could not be successful due to mem-
based LMs. Using phones in both acoustic and language mod-
ory problem 1 . The hidden and embedding dimensions are
eling led to improvements in performance over the character-
1200 each. For character as well as phone based LMs, the
based system for Amharic and Tigrigna. This is attributed
number of decoder layers is 2. The embedding and hidden
to the fact that the Amharic and Tigrigna characters are CV
dimensions are 48 and 650, respectively. For phone-based
syllables, which affects the number of examples used in the
LMs as well as E2E acoustic modeling, we have converted
training. Since the word-based LMs have low performance
the training transcriptions as well as the LM text to phone-
compared to the character- and phone-based ones, we used
based representations as a pre-processing task. In this task we
character- and phone-based models in all other experiments.
have used the pronunciation dictionary of the respective cor-
pus (both target and source corpora) to convert the grapheme- 5. MULTILINGUAL E2E ASR SYSTEMS
based speech transcriptions (of the training, dev and test sets)
The first ML E2E ASR experiment we have conducted is
and LM training texts of the target languages to their phone
using data from two related languages. Since Amharic and
representation.
Tigrigna are related and use the same writing system, we have
WSJ recipe for the encoder-decoder model with the de-
developed ML E2E ASR by combining the data of these two
fault hyper-parameters has been used to train the acoustic
languages. Similarly, since Oromo and Wolaytta are related
models. We have used 3 encoder and 3 decoder layers. The
and use a similar writing system, we have developed ML E2E
embedding dimension is 48 while the hidden dimension is
ASR for these languages by combining their data. These sys-
320. During decoding, we experimented on different beam
tems are referred as ML2, where ML stands for multilingual
sizes and found that the best beam size is 50. We have also
and 2 stands for the number of languages whose data are
experimented with different LM fusion weights and consid-
used in E2E ASR system training. We did preliminary ex-
ered the weight that leads to low WER for the majority of the
periments by developing character-based ML2 E2E systems
corpora. For the word-based LMs, the best LM fusion weight
by combining LM training data only, speech data only and
is 0.2 while for the character- and phone-based LMs, the best
both LM and speech data. We have decoded speech data of
fusion weight is 0.5. Table 1 presents the Character Error Rate
Tigrigna and Wolaytta. However, ASR performance improve-
(CER), Phone Error Rate (PER) and WER of the monolingual
ments have been obtained when we combine speech data only.
E2E ASR systems for each of the languages. For compari-
Therefore, the remaining ML E2E experiments are conducted
son purpose, we also presented the hybrid HMM-DNN WERs
by combining only speech data of the involved languages.
presented in [2], which used triphone-based acoustic models
Table 2 presents results of both character- and phone-based
and word-based LMs. Although we have presented the CER
ML2 systems. In these systems, for Amharic and Tigrigna,
and PER of all the E2E ASR systems, we have made all the
three speech corpora (two Amharic + one Tigrigna) are used
comparisons and analysis on the basis of WER for it is the
in training the ML character- and phone-based E2E acoustic
standard ASR evaluation metrics.
model. Test speech of each corpus is decoded. Oromo and
As can be seen from Table 1, E2E systems fused with
Wolaytta speech data are used in training, and test speech of
word-based LMs did not perform better than the HMM-DNN
each language is decoded. The LMs fused during decoding
1 We used machines with a total RAM of 251 GBs. are LMs developed using LM training text of each language.

7015

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 24,2021 at 14:36:05 UTC from IEEE Xplore. Restrictions apply.
Japanese due to the mixed encoding used in their transcrip-
Table 2. CER, PER and WER of ML2 E2E ASR
tions. Thus, the speech data of these languages could not be
Language/Corpora Character based Phone based used in ML E2E training. We have, therefore, used data of the
CER WER PER WER remaining 19 GlobalPhone languages together with the data
AMH2005 4.23 13.90 3.28 13.63
AMH2020 10.81 29.58 7.98 27.81 of the 4 Ethiopian languages. The ML E2E system trained us-
TIR 8.21 23.00 5.30 20.91 ing data from these languages is referred as ML23, since data
ORM 9.41 29.12 10.13 29.01 from 23 languages are used in training. However, the target
WAL 8.3 22.66 8.7 21.04 languages are always the four Ethiopian languages. Table 3
depicts the PER and WER of ML23 for the target languages.
As can be seen from Table 2, ML2 E2E systems led to
lower WER compared to the monolingual E2E systems. Gen- Table 3. PER and WER of ML4 and ML23 E2E ASR
erally, relative WER improvements ranging from 2.96% to
Language/Corpora ML4 ML23
29.83% have been obtained in the character-based ML2 E2E PER WER PER WER
systems. The highest relative WER improvement being for AMH2005 3.26 14.06 3.12 13.44
AMH2005 corpus while the lowest relative improvemnt is AMH2020 7.26 26.16 7.67 29.34
for Wolaytta, which has relatively bigger speech corpus than TIR 5.12 21.14 5.49 22.43
ORM 8.7 27.37 9.17 28.86
the others. The relative WER improvements for phone-based WAL 6.5 18.51 6.63 19.91
ML2 E2E ASR ranges from 4.45% for Oromo to 28.45% for
AMH2005. As shown in Table 3, ML23 system did not bring WER
Since using speech data of two languages in E2E ASR reduction, over the ML4 E2E systems, for majority of the
training brought improvement in ASR performance, we then target corpora except for AMH2005 corpus for which 4.41%
experimented by using speech data of the four Ethiopian lan- relative WER reduction has been obtained. This might be
guages in E2E ASR system training. We call these systems due to the similarity in the domain of the source and this tar-
ML4, following the naming convention described above. get corpora. The domain of AMH2005 and GlobalPhone is
Since the writing systems used by the languages are different, news, whereas the domain of the other Ethiopian languages is
we have converted the training transcriptions and LM train- mixed (news, Bible, other religious books, etc.). Compared
ing texts of the languages to phone-based representation so to the phone-based ML2 systems, ML23 resulted in WER
as to use data from all the languages. To enable the develop- reduction for AMH2005, Oromo and Wolaytta. However,
ment of ML speech processing, the phone names are made ML23 systems outperformed all the monolingual systems ex-
consistent across languages using the ML phone represen- cept for Tigrigna for which insignificant performance degra-
tation we developed in [15]. Table 3 presents the PER and dation (0.25 absolute WER increase in ML23) is observed.
WER of ML4 E2E ASR systems. As can be seen from the The results of our experiments, generally, confirm that us-
Table, WER reductions, over the ML2 systems, have been ing data of related languages in ML ASR training leads to
obtained for three (AMH2020, Oromo and Wolaytta) of the greater performance improvement. However, using data from
five corpora, as a result of using all the corpora in E2E ASR any language in a ML setup is always more advantageous than
system training. Though we did not get WER improvement the use of monolingual data only.
for Tigrigna and AMH2005, the degradation in their perfor-
mance (from the ML2 system) is statistically insignificant. 6. CONCLUSIONS
However, compared to the monolingual E2E systems, the
ML4 systems brought WER reduction for all the corpora. In this paper, we presented ML E2E ASR experiments for
The relative improvement ranges from 4.69% for Tigrigna to four Ethiopian target languages for which no E2E ASR exper-
26.19% for AMH2005. iments have been conducted previously. In our experiments,
As the use of speech data from the four Ethiopia lan- we used different language (word, character and phones) and
guages brought WER improvement, we have experimented on acoustic (character and phone) modeling units. Character and
the use of more data from other languages. For this purpose, phone units are generally better than small vocabulary words
we have used GlobalPhone, which provides speech and text for language modeling. For acoustic modeling, phone-based
data for 22 languages as described in Section 3, together with models are better than character-based ones for most of the
the speech data of the four Ethiopian languages. As we did languages. Although combining only LM training text and
in ML4 E2E experiments, we have converted the transcrip- combining both training speech and LM training text did not
tions of the GlobalPhone languages to phone-based represen- lead to performance improvement, combining training speech
tation taking advantage of the ML phone representations we only from related and even less related languages resulted
have developed in [15] and the available pronunciation dictio- in performance improvement over monolingual E2E systems.
nar for all the languages. However, the conversion to phone We would propose extending this research for more target lan-
transcriptions could not be successful for Arabic, Thai and guages and conduct experiments using E2E approach at dif-
ferent levels of data scarcity.
7016

Authorized licensed use limited to: IEEE Xplore. Downloaded on May 24,2021 at 14:36:05 UTC from IEEE Xplore. Restrictions apply.
7. REFERENCES modeling for ethio-semitic languages,” in Interspeech,
2020.
[1] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl,
[15] Martha Yifiru Tachbelie, Solomon Teferra Abate, and
Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se-
Tanja Schultz, “Development of multilingual asr using
nior, Vincent Vanhoucke, Patrick Nguyen, Brian Kings-
globalphone for less-resourced languages: The case of
bury, et al., “Deep neural networks for acoustic mod-
ethiopian languages,” in INTERSPEECH, 2020.
eling in speech recognition,” IEEE Signal processing
magazine, vol. 29, 2012. [16] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen,
M. Ranzato, M. Devin, and J. Dean, “Multilin-
[2] Solomon Teferra Abate, Martha Yifiru Tachbelie, and
gual acoustic models using distributed deep neural net-
Tanja Schultz, “Deep neural networks based automatic
works,” in ICASSP, 2013, pp. 8619–8623.
speech recognition for four ethiopian languages,” in
ICASSP, 2020. [17] Xinjian Li, Siddharth Dalmia, Alan Black, and Florian
Metze, “Multilingual speech recognition with corpus
[3] Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus
relatedness sampling,” 2019.
Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and
Hermann Ney, “RWTH ASR Systems for LibriSpeech: [18] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer-
Hybrid vs Attention,” in Proc. Interspeech 2019, 2019. nocký, and Sanjeev Khudanpur, “Recurrent neural net-
work based language model,” in INTERSPEECH, 2010.
[4] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen,
and R. Prabhavalkar, “An analysis of incorporating an [19] Bo Li, Yu Zhang, T. Sainath, Yonghui Wu, and William
external language model into a sequence-to-sequence Chan, “Bytes are all you need: End-to-end multilingual
model,” in 2018 IEEE ICASSP, 2018, pp. 1–5828. speech recognition and synthesis with bytes,” ICASSP,
pp. 5621–5625, 2019.
[5] A. Datta, B. Ramabhadran, J. Emond, A. Kannan, and
B. Roark, “Language-agnostic multilingual modeling,” [20] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li,
in ICASSP 2020, 2020, pp. 8239–8243. P. Moreno, E. Weinstein, and K. Rao, “Multilingual
speech recognition with a single end-to-end model,” in
[6] Tanja Schultz and Alex Waibel, “Language-independent
ICASSP, 2018, pp. 4904–4908.
and language-adaptive acoustic modeling for speech
recognition,” Speech Commun., vol. 35, no. 1-2, pp. [21] J. Cho, M. Baskar, Ruizhi Li, Matthew Wiesner, Sri Har-
31–51, Aug. 2001. ish Mallidi, Nelson Yalta, M. Karafiát, S. Watanabe,
and T. Hori, “Multilingual sequence-to-sequence speech
[7] Fuliang Weng, Harry Bratt, Leonardo Neumeyer, and
recognition: Architecture, transfer learning, and lan-
Andreas Stolcke, “A study of multilingual speech recog-
guage modeling,” 2018 IEEE SLT, 2018.
nition,” in EUROSPEECH, 1997.
[8] T. Schultz and A. Waibel, “Multilingual and crosslin- [22] S. Zhou, S. Xu, and Bo Xu, “Multilingual end-to-end
gual speech recognition,” in Proc. DARPA Workshop speech recognition with a single transformer on low-
on Broadcast News Transcription and Understanding, resource languages,” ArXiv, vol. abs/1806.05059, 2018.
1998, pp. 259–262. [23] Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe,
[9] Tanja Schultz, “Globalphone: a multilingual speech and “Globalphone: A multilingual text and speech database
text database developed at karlsruhe university.,” in IN- in 20 languages,” in ICASSP, 2013.
TERSPEECH. 2002, ISCA. [24] Solomon Teferra Abate, Wolfgang Menzel, and Bahiru
[10] S. Kanthak and H. Ney, “Multilingual acoustic model- Tafila, “An amharic speech corpus for large vocabulary
ing using graphemes,” in IN EUROSPEECH, 2003. continuous speech recognition,” in INTERSPEECH,
2005.
[11] Ngoc Thang Vu, David Imseng, Daniel Povey, Petr
Motlı́cek, Tanja Schultz, and Hervé Bourlard, “Mul- [25] Solomon Teferra Abate, Martha Yifiru Tachbelie,
tilingual deep neural network based acoustic modeling Michael Melese, Hafte Abera, Tewodros Abebe, Wond-
for rapid language adaptation,” ICASSP, 2014. wossen Mulugeta, Yaregal Assabie, Million Meshesha,
Solomon Atinafu, and Binyam Ephrem, “Large vocabu-
[12] Markus Müller and Alex H. Waibel, “Using language lary read speech corpora for four ethiopian languages:
adaptive deep neural networks for improved multilin- Amharic, tigrigna, oromo and wolaytta,” in LREC,
gual speech recognition,” 2015. 2020.
[13] Ekapol Chuangsuwanich, Multilingual techniques for [26] Y. Wang, T. Chen, H. Xu, S. Ding, H. Lv, Y. Shao,
low resource automatic speech recognition, Ph.D. the- N. Peng, L. Xie, S. Watanabe, and S. Khudanpur,
sis, 2016. “Espresso: A fast end-to-end neural speech recognition
[14] Solomon Teferra Abate, Martha Yifiru Tachbelie, and toolkit,” in 2019 IEEE ASRU, 2019.
Tanja Schultz, “Multilingual acoustic and language