Conformer
Conformer
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo
                                                                 Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang
                                                                                                         Google Inc.
                                             {anmolgulati, jamesqin, chungchengc, nikip, ngyuzh, jiahuiyu, weihan, shibow, zhangzd,
                                                                           yonghui, rpang}@google.com
                                                                     Abstract                                    40 ms rate
                                                                                                                                                               Layernorm
                                           networks (RNNs). Transformer models are good at captur-                                                       Feed Forward Module
                                                                                                                                     Dropout
                                           ing content-based global interactions, while CNNs exploit lo-
                                           cal features effectively. In this work, we achieve the best of          40 ms rate
                                                                                                                                                                   +
                                           both worlds by studying how to combine convolution neural                                 Linear
                                           networks and transformers to model both local and global de-                                                   Convolution Module
Figure 2: Convolution module. The convolution module contains a pointwise convolution with an expansion factor of 2 projecting the
number of channels with a GLU activation layer, followed by a 1-D Depthwise convolution. The 1-D depthwise conv is followed by a
Batchnorm and then a swish activation layer.
ment on the testother dataset with an external language model.                                2.2. Convolution Module
We present three models based on model parameter limit con-
                                                                                              Inspired by [17], the convolution module starts with a gating
straints of 10M , 30M and 118M. Our 10M model shows an im-
                                                                                              mechanism [23]—a pointwise convolution and a gated linear
provement when compared to similar sized contemporary work
                                                                                              unit (GLU). This is followed by a single 1-D depthwise convo-
[10] with 2.7%/6.3% on test/testother datasets. Our medium
                                                                                              lution layer. Batchnorm is deployed just after the convolution
30M parameters-sized model already outperforms transformer
                                                                                              to aid training deep models. Figure 2 illustrates the convolution
transducer published in [7] which uses 139M model parameters.
                                                                                              block.
With the big 118M parameter model, we are able to achieve
2.1%/4.3% without using language models and 1.9%/3.9% with
                                                                                              2.3. Feed Forward Module
an external language model.
     We further carefully study the effects of the number of at-                              The Transformer architecture as proposed in [6] deploys a feed
tention heads, convolution kernel sizes, activation functions,                                forward module after the MHSA layer and is composed of two
placement of feed-forward layers, and different strategies of                                 linear transformations and a nonlinear activation in between. A
adding convolution modules to a Transformer-based network,                                    residual connection is added over the feed-forward layers, fol-
and shed light on how each contributes to the accuracy improve-                               lowed by layer normalization. This structure is also adopted by
ments.                                                                                        Transformer ASR models [7, 24].
                                                                                                   We follow pre-norm residual units [21, 22] and apply layer
                                                                                              normalization within the residual unit and on the input before
                    2. Conformer Encoder                                                      the first linear layer. We also apply Swish activation [25] and
                                                                                              dropout, which helps regularizing the network. Figure 4 illus-
Our audio encoder first processes the input with a convolution                                trates the Feed Forward (FFN) module.
subsampling layer and then with a number of conformer blocks,
as illustrated in Figure 1. The distinctive feature of our model is                           2.4. Conformer Block
the use of Conformer blocks in the place of Transformer blocks
                                                                                              Our proposed Conformer block contains two Feed Forward
as in [7, 19].
                                                                                              modules sandwiching the Multi-Headed Self-Attention module
    A conformer block is composed of four modules stacked                                     and the Convolution module, as shown in Figure 1.
together, i.e, a feed-forward module, a self-attention module,                                     This sandwich structure is inspired by Macaron-Net [18],
a convolution module, and a second feed-forward module in                                     which proposes replacing the original feed-forward layer in the
the end. Sections 2.1, 1, and 2.3 introduce the self-attention,                               Transformer block into two half-step feed-forward layers, one
convolution, and feed-forward modules, respectively. Finally,                                 before the attention layer and one after. As in Macron-Net, we
2.4 describes how these sub blocks are combined.                                              employ half-step residual weights in our feed-forward (FFN)
                                                                                              modules. The second feed-forward module is followed by a
2.1. Multi-Headed Self-Attention Module                                                       final layernorm layer. Mathematically, this means, for input xi
                                                                                              to a Conformer block i, the output yi of the block is:
We employ multi-headed self-attention (MHSA) while integrat-                                                             1
ing an important technique from Transformer-XL [20], the rel-                                                 x˜i = xi +   FFN(xi )
                                                                                                                         2
ative sinusoidal positional encoding scheme. The relative po-                                                  0
                                                                                                              xi = x˜i + MHSA(x˜i )
sitional encoding allows the self-attention module to general-                                                                                              (1)
ize better on different input length and the resulting encoder is                                            x00i = x0i + Conv(x0i )
more robust to the variance of the utterance length. We use pre-                                                                           1
norm residual units [21, 22] with dropout which helps training                                                yi = Layernorm(x00i +          FFN(x00i ))
                                                                                                                                           2
and regularizing deeper models. Figure 3 below illustrates the
multi-headed self-attention block.                                                            where FFN refers to the Feed forward module, MHSA refers to
                                                                                              the Multi-Head Self-Attention module, and Conv refers to the
                                                                                              Convolution module as described in the preceding sections.
                                                                                                  Our ablation study discussed in Sec 3.4.3 compares the
                       Multi-Head Attention with                                              Macaron-style half-step FFNs with the vanilla FFN as used in
        Layernorm        Relative Positional             Dropout            +
                              Embedding                                                       previous works. We find that having two Macaron-net style
                                                                                              feed-forward layers with half-step residual connections sand-
Figure 3: Multi-Headed self-attention module. We use multi-                                   wiching the attention and convolution modules in between pro-
headed self-attention with relative positional embedding in a                                 vides a significant improvement over having a single feed-
pre-norm residual unit.                                                                       forward module in our Conformer architecture.
                                                                                                  The combination of convolution and self-attention has been
                                                                                              studied before and one can imagine many ways to achieve
                                                   Linear    Swish                    Linear
                                       Layernorm                         Dropout                   Dropout     +
                                                   Layer    Activation                Layer
Figure 4: Feed forward module. The first linear layer uses an expansion factor of 4 and the second linear layer projects it back to the
model dimension. We use swish activation and a pre-norm residual units in feed forward module.
that. Different options of augmenting convolutions with self-               Table 1: Model hyper-parameters for Conformer S, M, and L
attention are studied in Sec 3.4.2. We found that convolution               models, found via sweeping different combinations and choos-
module stacked after the self-attention module works best for               ing the best performing models within the parameter limits.
speech recognition.
                                                                                                         Conformer          Conformer          Conformer
                                                                                     Model
                                                                                                            (S)               (M)                 (L)
                    3. Experiments
3.1. Data                                                                    Num Params (M)                10.3               30.7               118.8
                                                                             Encoder Layers                 16                 16                 17
We evaluate the proposed model on the LibriSpeech [26]                       Encoder Dim                    144                256                512
dataset, which consists of 970 hours of labeled speech and                   Attention Heads                 4                  4                  8
an additional 800M word token text-only corpus for building                  Conv Kernel Size               32                 32                 32
language model. We extracted 80-channel filterbanks features                 Decoder Layers                  1                  1                  1
computed from a 25ms window with a stride of 10ms. We use                    Decoder Dim                    320                640                640
SpecAugment [27, 28] with mask parameter (F = 27), and ten
time masks with maximum time-mask ratio (pS = 0.05), where
the maximum-size of the time mask is set to pS times the length             Table 2: Comparison of Conformer with recent published mod-
of the utterance.                                                           els. Our model shows improvements consistently over various
                                                                            model parameter size constraints. At 10.3M parameters, our
3.2. Conformer Transducer                                                   model is 0.7% better on testother when compared to contempo-
                                                                            rary work, ContextNet(S) [10]. At 30.7M model parameters our
We identify three models, small, medium and large, with 10M,                model already significantly outperforms the previous published
30M, and 118M params, respectively, by sweeping different                   state of the art results of Transformer Transducer [7] with 139M
combinations of network depth, model dimensions, number of                  parameters.
attention heads and choosing the best performing one within
model parameter size constraints. We use a single-LSTM-layer                Method                 #Params (M)     WER Without LM           WER With LM
decoder in all our models. Table 1 describes their architecture                                                    testclean   testother   testclean   testother
hyper-parameters.                                                           Hybrid
      For regularization, we apply dropout [29] in each residual              Transformer [33]           -             -           -         2.26        4.85
                                                                            CTC
unit of the conformer, i.e, to the output of each module, before              QuartzNet [9]              19          3.90       11.28        2.69        7.25
it is added to the module input. We use a rate of Pdrop = 0.1.              LAS
Variational noise [5, 30] is introduced to the model as a regu-               Transformer [34]          270          2.89        6.98        2.33        5.17
                                                                              Transformer [19]           -           2.2          5.6        2.6         5.7
larization. A `2 regularization with 1e − 6 weight is also added              LSTM                      360          2.6          6.0        2.2         5.2
to all the trainable weights in the network. We train the models            Transducer
                                                                              Transformer [7]           139          2.4          5.6        2.0          4.6
with the Adam optimizer [31] with β1 = 0.9, β2 = 0.98 and                     ContextNet(S) [10]       10.8          2.9          7.0        2.3          5.5
 = 10−9 and a transformer learning rate schedule √     [6], with             ContextNet(M) [10]       31.4          2.4          5.4        2.0          4.5
                                                                              ContextNet(L) [10]       112.7         2.1          4.6        1.9          4.1
10k warm-up steps and peak learning rate 0.05/ d where d is
                                                                            Conformer (Ours)
the model dimension in conformer encoder.                                     Conformer(S)              10.3         2.7          6.3        2.1          5.0
      We use a 3-layer LSTM language model (LM) with width                    Conformer(M)              30.7         2.3          5.0        2.0          4.3
4096 trained on the LibriSpeech langauge model corpus with                    Conformer(L)             118.8         2.1          4.3        1.9          3.9