Tarver 2019
Tarver 2019
   Abstract—Digital predistortion is the process of using digital              DPD coefficients, it has been shown to converge to a biased
signal processing to correct nonlinearities caused by the analog               solution due to noise in the PA output [8, 9]. Moreover, the
RF front-end of a wireless transmitter. These nonlinearities                   LS problem is often poorly conditioned [4]. In [10], a mobile
contribute to adjacent channel leakage, degrade the error vector
magnitude of transmitted signals, and often force the transmitter              graphics processing units (GPU) was used to implement the
to reduce its transmission power into a more linear but less                   polynomial DPD with I/Q imbalance correction from [4]. This
power-efficient region of the device. Most predistortion tech-                  GPU implementation used floating-point and was able to avoid
niques are based on polynomial models with an indirect learning                the challenges associated with the dynamic range requirements
architecture which have been shown to be overly sensitive to                   for memory polynomials. When implemented on an FPGA, a
noise. In this work, we use neural network based predistortion
with a novel neural network training method that avoids the                    memory polynomial can be challenging due to the bit-widths
indirect learning architecture and that shows significant improve-              that are necessary to perform the high-order exponentiation in
ments in both the adjacent channel leakage ratio and error vector              fixed-point precision [11].
magnitude. Moreover, we show that, by using a neural network                      The overall DPD challenge has strong similarities to the
based predistorter, we are able to achieve a 42% reduction in                  problems encountered in in-band full-duplex (IBFD) commu-
latency and 9.6% increase in throughput on an FPGA accelerator
with 15% fewer multiplications per sample when compared to a                   nications [12–14], where a transceiver simultaneously trans-
similarly performing memory-polynomial implementation.                         mits and receives on the same frequency, increasing the
   Index Terms—Digital predistortion, neural networks, FPGA.                   spectral efficiency of the communication system. However,
                                                                               this requires (among other techniques) digitally removing the
                         I. I NTRODUCTION                                      significant self-interference from the received signal which
                                                                               not only consists of the intended transmission but also the
   Efficiently correcting nonlinearities in power amplifiers
                                                                               nonlinearities added by the imperfections in the transmit chain
(PAs) through digital predistortion (DPD) is critical for en-
                                                                               including the PA. In [15], the author used neural networks
abling next-generation mobile broadband where there may be
                                                                               (NNs) to perform the self-interference cancellation and found
multiple radio frequency (RF) transmit (TX) chains arranged
                                                                               that it could achieve similar performance to polynomial based
to form a massive multiple-input multiple-output (MIMO)
                                                                               self-interference cancellation. This work was later extended
system [1], as well as new waveforms with bandwidths on the
                                                                               to create both FPGA and ASIC implementations of the NN-
order of 100 MHz in the case of mmWave communications [2].
                                                                               based self-interference canceller [16]. It was found that, due
Traditional DPDs use variations of the Volterra series [3], such
                                                                               to the regular structure of the NN and the lower bit-width
as memory polynomials [4, 5]. These models consist of sums
                                                                               requirements, it can be implemented to have both a higher
of various order polynomials and finite impule responce (FIR)
                                                                               throughput and lower resource utilization.
filters to model the nonlinearities and the memory effects in a
                                                                                  Inspired by the full-duplex NN work and the known prob-
PA, respectively.
                                                                               lems of polynomial based predistortion with an ILAs, we
   To learn the values of the parameters in a polynomial based
                                                                               recently proposed in [17] to use NNs for the forward DPD
model, an indirect learning architecture (ILA) is typically used
                                                                               application. The NNs are a natural choice for such application
in conjunction with some variation of a least squares (LS) fit
                                                                               as they are able to approximate any nonlinear function [18],
of the data to the model [5]. In an ILA, a postinverse model
                                                                               making them a reasonable candidate for predistortion. The idea
of the predistorter is fitted based on the output of the PA [6,
                                                                               of using various NNs for predistortion has been explored in
7]. After learning the postinverter, the coefficients are copied
                                                                               many works [19, 20]. However, the training method is unclear
to the predistorter. Although this simplifies the learning of
                                                                               in [19], and their implementations require over ten thousand
   The work of C. Tarver and J. R. Cavallaro was supported in part by the      parameters. In [20], the training of the NN is done using an
U.S. NSF under grants ECCS-1408370, CNS-1717218, and CNS-1827940,              ILA which can subject the learned predistorter to the same
for the “PAWR Platform POWDER-RENEW: A Platform for Open Wireless              problems seen with all ILAs.
Data- driven Experimental Research with Massive MIMO Capabilities.” The
work of A. Balatsoukas-Stimming was supported by the Swiss NSF project            Contribution: In our previous work [17], we avoided the
PZ00P2 179686.                                                                 standard ILA and we improved the overall performance by
  Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on July 08,2020 at 03:53:10 UTC from IEEE Xplore. Restrictions apply.
                                                                                                           Input      Hidden        Output
                                     x̂[n]
x[n]          NN DPD, Ĥ      −1                    PA, H                    y[n]                          layer      layers         layer
                                                                                                (x)                                            (y)
                                                                               1
                                   Training                                                                                N
                                                                                                                         ...
                                                                               G
                                                                                                (x)                                            (y)
                                              PA NN Model, Ĥ
                                                                                    Figure 2. General structure of the DPD and PA neural networks. There are
  Figure 1. Architecture of the NN DPD system. The signal processing                two input and output neurons for the real and imaginary parts of the signal, N
  is done in the digital baseband and focuses on PA effects. The DAC,               neurons per hidden layer, and K hidden layers. The inputs are directly added
  up/downconverters, and ADC are not shown in this figure, though their              to the output neurons so that the hidden layers concentrate on the nonlinear
  impairments are also captured.                                                    portion of the signal.
  using a novel training algorithm where we first modeled the                        learn this relationship given training data, this turns out to
  PA with a NN and then backpropagated through it to train a                        be difficult in practice [15]. As such, we implement a linear
  DPD NN. We extend that work here to show that not only do                         bypass in our NN that directly passes the inputs to the output
  we improve performance when compared to polynomial based                          neurons where they are added in with the output from the
  DPD, but we do so with reduced implementation complexity.                         final hidden layer, as can be seen in Fig. 2. This way, the NN
  Furthermore, to realize the gains of the NN DPD, we design                        entirely focuses on the nonlinear portion of the signal.
  a custom FPGA accelerator for the task and compare it to our
  own polynomial DPD accelerator.                                                   B. Training
     Outline: The rest of the paper is organized as follows. In                        This work primarily focuses on the implementation and
  Section II, we give an overview of our DPD architecture and                       running complexity of the DPD application, which consists of
  methods. In Section III, we compare performance/complexity                        inference on a pre-trained NN. The training is assumed to be
  tradeoffs for the DPD NN to polynomial based predistorters. In                    able to run offline and, once the model is learned, significant
  Section IV, we compare FPGA implementations for memory                            updates will not be necessary and occasional offline re-training
  polynomial and NN predistortion. Finally, in Section V we                         to account for long-term variations would be sufficient.
  conclude the paper.                                                                  In [17], we first use input/output data of the PA to train
                                                                                    a NN to model the PA behavior. We then connect a second
       II. N EURAL N ETWORK DPD A LGORITHM OVERVIEW
                                                                                    DPD NN to the PA NN model. We treat the combined DPD
     For the NN DPD system, we seek to place a NN based                             NN and PA NN as one large NN. However, during the second
  predistorter inline with the PA so that the cascade of the two                    training phase, we only update the weights corresponding to
  is a linear system, as shown in Fig. 1. However, to train a                       the DPD NN. We then connect the DPD NN to the real PA
  NN, it is necessary to have training data, and in this scenario                   and use it to predistort for the actual device.
  the ideal NN output is unknown; only the ideal PA output is                          The process of predistorting can excite a different region
  known. To overcome this problem, we train a PA NN model to                        of the PA than when predistortion is not used. To account
  emulate the PA. We then backpropagate the mean squared error                      for this, it is not uncommon in other DPD methods to have
  (MSE) through the PA NN model to update the parameters in                         multiple training iterations. A similar idea is adopted in [17]
  the NN DPD [17].                                                                  and in this work. Once training of the PA and the DPD is
                                                                                    performed, we then retransmit through the actual PA while
  A. Neural Network Architecture
                                                                                    using the DPD NN. Using the new batch of input/output data,
     We use a feed-forward NN that is fully-connected with                          we then can update the PA NN model and in turn refine the
  K hidden layers, and N neurons per hidden layer. The                              DPD NN. An example of the iterative training procedure is
  nonlinear activation applied in hidden layers is chosen to be                     shown in Fig. 3, where the MSE training loss is shown for
  a rectified linear unit (ReLU), shown in (1), which can easily                     the PA NN model and the combined DPD-PA is shown for
  be implemented with a single multiplexer in hardware.                             two training iterations.
                            ReLU(x) = max(0, x)                              (1)                      III. C OMPLEXITY C OMPARISON
  The input and output data to the predistorter is complex-                            To evaluate the NN based predistortion, we present the
  valued, while NNs typically operate on real-valued data. To                       formulation of both a memory polynomial and the NN. We
  accommodate this, we split the real and imaginary parts of                        then derive expressions for the number of multiplications as a
  each time-domain input sample, x(n), on to separate neurons.                      function of the number of parameters in the models. In most
    Although PA-induced nonlinearities are present in the trans-                    implementations, multiplications are considered to be more
  mitted signal, the relationship between the input and output                      expensive as they typically have higher latency and require
  data is still mostly linear. Although in principle, a NN can                      more area and power. Additions typically have a minor impact
297
       Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on July 08,2020 at 03:53:10 UTC from IEEE Xplore. Restrictions apply.
                                 P 
                                  M                                                Q 
                                                                                     L
                      x̂(n) =               αp,m x(n − m)|x(n − m)|p−1 +                       βq,l x∗ (n − l)|x∗ (n − l)|q−1 + c                           (2)
                                 p=1, m=0                                           q=1, l=0
                                 p odd                                              q odd
298
  Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on July 08,2020 at 03:53:10 UTC from IEEE Xplore. Restrictions apply.
                                                               K=1                                                                            K=1
            −30                                                                                  3
                                                               K=2                                                                            K=2
                                                               M =1                                                                           M =1
                                                               M =2                                                                           M =2
ACLR (dB)
                                                                                     EVM (%)
                                                               M =4                                                                           M =4
            −31                                                                                 2.5
            −32
                                                                                                 2
Figure 4. ACLR vs. number of multiplications for NN DPD (shown with                Figure 5. EVM vs. number of real multiplications for NN DPD (shown with
diamonds) with up to K = 2 hidden layers and memory polynomial (shown              diamonds) with up to K = 2 hidden layers and memory polynomial (shown
with circles) with up to M = 4 memory taps. This represents the out-of-            with circles) with up to M = 4 memory taps. This represents the in-band
band performance of the predistorter. The stars represent design points that       performance of the predistorter. The stars represent design points that we
we implement in FPGA in the next section.                                          implement in FPGA in the next section
for the feedback on the ADC and 16 bits for the DAC.                                                                                          No DPD
Using their M ATLAB API, we test the NN predistorter using                                            0                                       P =9
a 10 MHz OFDM signal. This signal has random data on                                                                                          N = 20
600 subcarriers spaced apart by 15 kHz and is similar to LTE
                                                                                     PSD (dB)
299
      Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on July 08,2020 at 03:53:10 UTC from IEEE Xplore. Restrictions apply.
domain, ŝ is the corresponding received vector after passing                                      Weights and
through the PA, and · represents the 2 norm.                                                    Biases RAM
   In Fig. 5, we see the EVM versus the number of mul-
tiplications for each of the predistorters. As the number of                        (x[n])              PE              PE      Add         (x̂[n])
multiplications increases, the EVM decreases, as expected.
The memoryless polynomial DPD is able to achieve a low                              (x[n])              PE              PE            Add    (x̂[n])
EVM for the smallest number of multiplications. However,
                                                                                                          ...
the complexity is only slightly higher for the NN based DPD,
which is able to achieve an overall better performance than all                                          PE
other examined polynomial DPDs.
                                                                                                            Linear Bypass
   3) Spectrum Comparison: The spectrum for both the mem-
                                                                                                          Pipeline Registers
ory polynomial and the NN DPDs are shown in Fig. 6. Here,
both predistorters have the same running complexity of 80
multiplications per time-domain input sample. However, the
                                                                                      Figure 7. General structure of the NN FPGA implementation.
NN is able to provide an additional 2.8 dB of suppression at
±20 MHz.
                                                                                                                                     ReLU
                                                                                    (x[n])             Multiply           Add                    h1,i (n)
            IV. FPGA A RCHITECTURE OVERVIEW                                                                                          Mux
   In this section, we compare a NN DPD accelerator with                                                 Weights
a memory polynomial based implementation. We implement                            From RAM                                 Add
                                                                                                         Cache
both designs in Xilinx System Generator and target for the
Zynq UltraScale+ RFSoC ZCU1285 evaluation board. For the                            (x[n])             Multiply
sake of this architecture comparison, we implement each to
be fully parallelized and pipelined as to compare the highest
throughput implementations of each. Based on the previous                        Figure 8. Example structure of a PE for the ith neuron in hidden layer 1.
analysis, we implement both with 16-bit fixed point precision
throughout.
   We synthesize FPGA designs targeting two separate                           to that parameter. These registers output to the corresponding
ACLRs. First, we target an ACLR of approximately -31.4 dB.                     multiplier or adder.
This target is achieved with a NN with N = 6 neurons and                          An example neuron PE is shown in Fig. 8. Each PE
K = 1 hidden layer and a 7th order memoryless polynomial.                      is implemented with a sufficient number of multipliers for
Second, we target a more aggressive ACLR below -32 dB.                         performing the multiplication of the weights by the inputs in
This is done with a NN with N = 14 neurons and K = 1                           parallel. The results from each multiplier are added together,
hidden layer. A memory polynomial with M = 2 and P = 11                        along with the bias and passed to the ReLU activation function,
is also used to achieve this.                                                  which is implemented with a single multiplexer.
300
  Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on July 08,2020 at 03:53:10 UTC from IEEE Xplore. Restrictions apply.
                                        DPD Coeffs.                            also be further reduced with pruning, and the accuracy could
                                          RAM                                  potentially be improved with retraining after quantization and
                                                                               pruning.
                                             FIR                                                              R EFERENCES
    x[n]            Pipeline Delays                     Sum       x̂[n]
                                            Filter                              [1] E. G. Larsson, O. Edfors, F. Tufvesson, and T. L. Marzetta, “Massive
                                                                                    MIMO for next generation wireless systems,” IEEE Commun. Mag.,
                                             FIR                                    vol. 52, no. 2, pp. 186–195, Feb. 2014.
                     x(n)|x(n)|2                                                [2] W. Roh et al., “Millimeter-wave beamforming as an enabling technology
                                            Filter                                  for 5G cellular communications: Theoretical feasibility and prototype
                                                                                    results,” IEEE Commun. Mag., vol. 52, no. 2, pp. 106–113, Feb. 2014.
                            ...
                                               ...
                                                                                [3] A. Zhu, M. Wren, and T. J. Brazil, “An efficient Volterra-based behav-
                                             FIR                                    ioral model for wideband RF power amplifiers,” in IEEE MTT-S Int.
                    x(n)|x(n)|P −1                                                  Microw. Symp. Digest, vol. 2, June 2003, pp. 787–790 vol.2.
                                            Filter
                                                                                [4] L. Anttila, P. Handel, and M. Valkama, “Joint mitigation of power
                                                                                    amplifier and I/Q modulator impairments in broadband direct-conversion
                                                                                    transmitters,” IEEE Trans. Microw. Theory Techn., vol. 58, no. 4, pp.
Figure 9. General structure of the high-throughput, low-latency, memory             730–739, Apr. 2010.
polynomial FPGA implementation.                                                 [5] A. Katz, J. Wood, and D. Chokola, “The Evolution of PA Linearization:
                                                                                    From Classic Feedforward and Feedback Through Analog and Digital
                                                                                    Predistortion,” IEEE Microw. Mag., vol. 17, no. 2, pp. 32–40, Feb. 2016.
                             Table I                                            [6] A. Balatsoukas-Stimming, A. C. M. Austin, P. Belanovic, and A. Burg.,
       C OMPARISON OF P ERFORMANCE AND FPGA U TILIZATION                            “Baseband and RF hardware impairments in full-duplex wireless sys-
                                                                                    tems: experimental characterisation and suppression,” EURASIP Journal
                               ACLR: -31.4 dB            ACLR: -32                  on Wireless Commun. and Networking, vol. 2015, no. 142, 2015.
                              N =6     P =7          N = 14 P = 11              [7] D. Korpi, L. Anttila, and M. Valkama, “Nonlinear self-interference can-
           Metric
                              K=1      M =1           K=1      M =2                 cellation in MIMO full-duplex transceivers under crosstalk,” EURASIP
    Num. of Params.           32       8             72       24                    Journal on Wireless Comm. and Netw., vol. 2017, no. 1, p. 24, Feb.
    LUT                       379      539           688      1424                  2017.
    LUTRAM                    16       120           16       224               [8] D. Zhou and V. E. DeBrunner, “Novel adaptive nonlinear predistorters
    FF                        538      991           1170     2730                  based on the direct learning algorithm,” IEEE Trans. on Signal Process-
    DSP                       24       27            56       66                    ing, vol. 55, no. 1, pp. 120–133, Jan. 2007.
    Worst Neg. Slack (ns)     8.72     8.68          8.49     8.34              [9] R. N. Braithwaite, “A comparison of indirect learning and closed loop
    Max. Freq. (MHz)          783      756           661      603                   estimators used in dgital predistortion of power amplifiers,” in IEEE
    Max. T/P (MS/s)           783      756           661      603                   MTT-S Int. Microw. Symp., May 2015, pp. 1–4.
    Latency (CC)              12       21            14       26               [10] K. Li et al., “Mobile GPU accelerated digital predistortion on a software-
                                                                                    defined mobile transmitter,” in IEEE Global Conf. on Signal and Inform.
                                                                                    Process. (GlobalSIP), Dec. 2015, pp. 756–760.
                                                                               [11] M. Younes, O. Hammi, A. Kwan, and F. M. Ghannouchi, “An accurate
numerous advantages over the memory polynomial. Specifi-                             complexity-reduced “PLUME” model for behavioral modeling and digi-
cally, for the target of an ACLR less than -32 dB, the NN                           tal predistortion of RF power amplifiers,” IEEE Trans. on Ind. Electron.,
requires 48% of the lookup tables (LUTs), 42% of the flip-                           vol. 58, no. 4, pp. 1397–1405, Apr. 2011.
                                                                               [12] M. Jain et al., “Practical, real-time, full duplex wireless,” in Proc. Int.
flops (FFs), and 15% reduction in the number of digital signal                       Conf. on Mobile Comput. and Netw. ACM, 2011, pp. 301–312.
processors (DSPs). In terms of timing, there is a 9.6% increase                [13] M. Duarte, C. Dick, and A. Sabharwal, “Experiment-driven characteri-
in throughput with a 46% decrease in latency. These reductions                      zation of full-duplex wireless systems,” IEEE Trans. Wireless Commun.,
                                                                                    vol. 11, no. 12, pp. 4296–4307, Dec. 2012.
in utilization occur while also seeing improved ACLR.                          [14] D. Bharadia, E. McMilin, and S. Katti, “Full duplex radios,” in ACM
                                                                                    SIGCOMM, 2013, pp. 375–386.
                            V. C ONCLUSIONS                                    [15] A. Balatsoukas-Stimming, “Non-linear digital self-interference cancel-
                                                                                    lation for in-band full-duplex radios using neural networks,” in IEEE
   In this paper, we explored the complexity/performance                            Int. Workshop on Signal Processing Advances in Wireless Commun.
tradeoffs for a novel, NN based DPD and found that the NN                           (SPAWC), June 2018, pp. 1–5.
                                                                               [16] Y. Kurzo, A. Burg, and A. Balatsoukas-Stimming, “Design and im-
could outperform memory polynomials and offered overall                             plementation of a neural network aided self-interference cancellation
unrivaled ACLR and EVM performance. Furthermore, we                                 scheme for full-duplex radios,” in Asilomar Conf. on Signals, Systems,
implemented each on an FPGA and found that the regular                              and Comput., Oct. 2018, pp. 589–593.
                                                                               [17] C. Tarver, L. Jiang, A. Sefidi, and J. Cavallaro, “Neural network DPD
matrix multiply structure in the NN based predistorter led to                       via backpropagation through a neural network model of the PA,” in
a lower latency design with less hardware utilization when                          Asilomar Conf. on Signals, Systems, and Comput., (to appear).
compared to a similarly performing polynomial-based DPD.                       [18] K.       Hornik,      “Approximation        capabilities     of     multi-
                                                                                    layer     feedforward    networks,”      Neural    Networks,     vol.    4,
   This work opens up many avenues for future work.                                 no.     2,     pp.   251    –     257,    1991.    [Online].     Available:
This work can be extended to also compare perfor-                                   http://www.sciencedirect.com/science/article/pii/089360809190009T
mance/complexity tradeoffs for more devices with a wider                       [19] R. Hongyo, Y. Egashira, T. M. Hone, and K. Yamaguchi, “Deep neural
                                                                                    network-based digital predistorter for doherty power amplifiers,” IEEE
variety of signals, including different bandwidths and multiple                     Microw. and Wireless Compon. Letters, vol. 29, no. 2, pp. 146–148, Feb.
component carriers. It is also possible to include memory                           2019.
cells such as recurrent neural networks (RNNs) in the NN to                    [20] M. Rawat and F. M. Ghannouchi, “Distributed spatiotemporal neural
                                                                                    network for nonlinear dynamic transmitter modeling and adaptive digital
account for memory effects. The NN is naturally well suited                         predistortion,” IEEE Trans. Instrum. Meas., vol. 61, no. 3, pp. 595–608,
for a GPU implementation which would be interesting in soft-                        Mar. 2012.
ware defined radio (SDR) systems. The NN complexity could                       [21] “RF WebLab.” [Online]. Available: http://dpdcompetition.com/rfweblab/
301
Authorized licensed use limited to: CMU Libraries - library.cmich.edu. Downloaded on July 08,2020 at 03:53:10 UTC from IEEE Xplore. Restrictions apply.