I P S
T
DSP design
                                                                                       By Eric Cigan
      In this article, Eric gives an overview of the benefits of using FPGAs in DSP design
                      and concludes with a list of recommended design rules.
Challenges of FPGA-based                        dramatically due in part to the growth in
                                                multimedia and communications systems.
                                                                                               The silicon resources of an FPGA lead to
                                                                                               staggering performance gains – while the
Not long ago, designers of high-                Products as diverse as 3G wireless base        fastest general purpose can deliver up to
performance, digital signal processing          stations, medical diagnostic imaging           5 billion MAC/s (multiply-accumulate
systems (DSPs) had two alternatives for         equipment – even driver-assist systems         per second), leading FPGA devices can
implementation – general purpose DSPs           that will automatically park a car – would     deliver more than 500 billion MAC/s –
or ASICs. General-purpose DSPs, such            be inconceivable without the use of            that’s more than 100x faster. What’s more,
as those from TI, Agere, Motorola, and          advanced DSP algorithms. The throughput        channelized applications such as those
Analog Devices are special-purpose              requirements of these systems has strain-      common to wireless communications
microprocessors optimized for common            ed the abilities of general-purpose DSPs.      naturally lend themselves to parallel im-
DSP operations. The benefit of general-         For example, one leading manufacturer          plementations in FPGAs. Growth rates
purpose DSPs is that they are the fast-         of advanced echo cancellation systems          in processing speed requirements versus
est method to get an algorithm running          incorporated more than 25 general-purpose      capabilities are shown in Figure 11. A
because they offer a comprehensive de-          DSPs on a single board to meet their           comparison of general purpose DSPs
velopment environment, with tools for           performance goals.                             versus FPGAs is shown in Table 12.
code analysis, debugging, and rapid
prototyping. The disadvantage of DSPs is        A new generation of programmable               Tips for DSP design
that ultimately they execute instructions       chips has emerged as an alternative to         Let’s now take a close look at how designers
serially, setting an upper limit on the         standard DSPs. Platform FPGAs such as          navigate the challenges of using FPGAs in
chip’s throughput.                              Altera’s Stratix II and Xilinx’s Virtex II,    design to avoid prolonged design cycles
                                                incorporate arrays of dedicated multipliers,   or reduce the component cost for end
ASICs offer the ability to break through        embedded memory, and high-speed I/O            products. It comes down to the following
these performance limitations. Custom           that make them ideal for DSP applications.     basic rules.
ASIC design lets the designer employ the
optimal mix of resources on a chip and to
place them in close physical proximity to
minimize delays. Moreover, ASICs are
ideal for use in portable electronics since
the flexibility of ASIC design allows the use
of processes and architectures optimized
for lower power consumption. The draw-
backs of ASICs are considerable:
■ They need to be fabricated.
■ They require more time to design.
■ They require more complex and expen-
  sive development tools.
■ A single design flaw could lead to a
  design respin causing additional cost
  and delay.
In the past, given these two choices, most
designers avoided ASICs unless absolutely
necessary.
Two trends have recently changed
the landscape. First, the demand for
high-performance DSPs has increased                                                      Figure 1
Reprinted from Embedded Computing Design / Spring 2004
                                 Industry leading, general           Industry leading
           Function                                                                              But how does the engineer determine that
                               purpose, DSP processor core            platform FPGA
                                                                                                 the design matches all the way through?
   8x8 Multiply Accumulate      4.8 Billion MACps                 1 Trillion MACps               The proven method for verifying designs
   (MAC)                        fclk = 600 MHz                    fclk = 300 MHz                 in top-down flows is via a testbench that
                                                                                                 feeds the design under test with its input
   FIR Filter                                                                                    stimulus and monitors whether its outputs
   – 256 Taps, Linear phase     9.3 Msps                          300 Msps                       match the expected results. Designers
   – 16-bit data/coefficients    fclk = 600 MHz                    fclk = 300 MHz                 should create testbenches that allow this
                                                                                                 methodology to be employed using HDL
   Complex FFT                  10 µs                             1 µs                           simulators, but better yet, they should seek
   – 1024 point, 16-bit data    fclk = 600 MHz                    fclk = 150 MHz
                                                                                                 out tools that automatically generate an
   Viterbi decoding             500 channels at 7.95 Kbps         155 Mbps (OC-3 rates)          RTL testbench from the original design
   throughput                   for a total of 3.9 Mbps                                          and use an HDL simulator to simulate and
                                                                                                 functionally verify the design including
   Reed-Solomon decoding        4.1 Mbps                          10 Gbps (OC-192 rates)         bit-true operation.
                                                                                                4
   throughput                   fclk = 600 MHz                    fclk = 85 MHz
   Turbo convolutional          Six 2 Mbps data streams           5.4 Mbps                       Rule #4
   decoder throughput           (6 iterations)                    (6 iterations)                 Don’t reinvent the wheel.
                                                                                                 DSP systems incorporate a number of
                                                                                                 building blocks that are common to most
                                         Table 1                                                 designs – FIR and IIR filters, fast Fourier
                                                                                                 transforms and discrete cosine transforms,
                                                                                                 channel coding, etc. Developing these
1
                                                                                                 functions from scratch comes at great
Rule #1                                          which meant that the hardware designer
                                                                                                 expense; in fact, according to Berkeley
Start at the beginning.                          and software developers needed to recreate
                                                                                                 Design Technology, Inc., developing a
Complex DSP designs start with an                the design.                                     new FFT in silicon can consume up to six
algorithm developer who creates the initial                                                      months. Designers need to adopt tools and
design based on existing designs and             Many embedded systems developers                techniques that provide access to a large
experience. According to the DSP market          are accustomed to implementing DSP              and growing variety of DSP intellectual
research firm Forward Concepts, the lead-        algorithms on general purpose DSPs in C         property (IP) that is geared towards DSP
ing tool for algorithm design is MATLAB          or assembly language. This puts hardware        design. While there are many sources for
from MathWorks. Using the MATLAB                 and software engineers into the role of         IP in hard form (laid out on target silicon)
language, algorithm developers can create        translating designs from one language to        or in soft form (delivered in synthesizeable
designs in a natural and productive form         another, creating many opportunities for in-    RTL), typically there is no corresponding
and may tap into an immense wealth of            serting errors with the attendant debugging     simulation model available in MATLAB.
designs, scripts, and engineering know-          process. To avoid this process altogether,      This breaks the verification flow, making
how available only in the MATLAB                 companies are looking to architectural          it difficult for the algorithm developer
language. Though designers can choose            synthesis tools that use the MATLAB             and the hardware designer to validate that
from other options including block-              M-file as the golden source for downstream      the algorithm is faithfully represented
level environments, such as Simulink             design, automatically synthesizing the          in silicon.
or SPW, or languages based on C/C++,             design at the Register Transfer Level
these environments are less widely used          (RTL). Coupled with traditional RTL
and there may not be as many designs             synthesis tools that can synthesize RTL to
available for them. Moreover, many               gate-level implementation, this establishes
constructs used in DSP designs – such as         an unbroken design flow from algorithmic
looping, repeated structures, and 2- or 3-       creation to hardware implementation.
dimensional data arrays – are much easier        The top-down design process is shown in
to represent in MATLAB than in block-            Figure 2.
                                              3
level environments. Once the algorithm
is created in MATLAB, it can be readily          Rule #3
shared or partitioned across a design team       Always check your work –
and reused over time.                            use a verification flow that is
2
                                                 complete.
Rule #2                                          In any design process, it’s essential to be
Avoid recopying your work                        able to verify that the design meets the
(or alternatively, “Don’t get                    higher level specifications. If the design
lost in translation”).                           starts as a floating-point algorithm in
Once the algorithm is available, the rest        MATLAB – also called an M-file – the
of the design team, including hardware           fixed-point M-file should behave within
designers, software developers, and system       an acceptable range of the floating-point
designers who integrated the design              M-file. The RTL implementation should
components, swings into motion. In the           then conform precisely to the fixed-point
past, the completed algorithm in MATLAB          M-file – in other words, the RTL should
became the executable specification,             be bit-true to the floating-point M-file.                       Figure 2
                                                                                     Reprinted from Embedded Computing Design / Spring 2004
5
Rule #5
Use your budget wisely.
DSP algorithms are almost always
developed using floating-point arithmetic,
giving the algorithm developer the ability
to evaluate a design in its best case
scenario. While general-purpose DSPs are
typically designed to perform 16- or 32-bit
arithmetic, implementing algorithms in
FPGAs or ASICs gives the designer the
ability to independently control the num-
ber of bits used to represent each number
in the algorithm. Using too may bits can
be costly – a 40% increase in the number
of bits in a multiplier can double its area
in silicon – but using too few can lead to
overflows or instabilities. When choosing
tools for implementing DSP algorithms in
silicon, designers should evaluate tools that
help automate this floating-point to fixed-                                              Figure 3
point conversion process.
6                                               7
Rule #6                                         Rule #7                                        most highly skilled and best equipped
                                                Given the time, you can                        design teams. The demand for a more
Keep your options open
                                                always make a design better                    efficient path from algorithm to an ASIC
with vendor-independent,                                                                       or FPGA has given rise to a new breed of
technology-independent flow.                     – use design exploration.                      EDA companies, such as AccelChip, that
As designers, we are all increasingly           Almost invariably, getting the functionality
                                                                                               bridge the gap between DSP algorithm
under pressure to cut costs, which often        of the design correct is just the beginning
                                                                                               development and silicon. Architectural
leads to having to select the supplier who      – then begins the pursuit of improving
                                                                                               synthesis tools such as this accelerate
can provide the lowest price, the best          performance to make specs and trying
                                                                                               design and implementation by automati-
availability, etc. The tools available in the   to shrink to a smaller device or go to a
                                                                                               cally synthesizing algorithms written
market fall into two categories.                slower speed grade to cut costs. Hardware
                                                                                               in floating-point MATLAB model to
                                                engineers have an arsenal of tools and
                                                                                               synthesizable VHDL or Verilog models
Vendor-supplied tools are available from        tricks at their disposal, but working at
                                                                                               suitable for standard ASIC and FPGA
companies offering FPGA devices for             gate-level – even at RTL level – has its
                                                                                               design flows. AccelChip’s toolset also
DSP and provide integrated environments         limits. Inserting intermediate registers
                                                                                               enables rapid design exploration targeting
spanning graphical design entry, IP block       can be difficult. Optimizing quantization
                                                                                               fidelity, performance, area, and cost
libraries, and RTL simulation and synthesis     throughout a design is particularly tedious
                                                                                               tradeoffs for optimal results while using
tools. But, these tools offer libraries of      and error-prone when changing at the RTL
                                                                                               MATLAB as a golden source.
DSP functions that can only target a single     level. And, if the algorithm developer
vendor’s devices. To convert a design           comes up with a brilliant new idea two
                                                                                               Eric Cigan is the Product Marketing
from one vendor’s tools to another is at        weeks into the hardware design, chances
                                                                                               Manager for AccelChip Inc., and is
best a time-consuming and error-prone           are that the project manager will decide to
                                                                                               responsible for product planning and
process, leaving the designer at the mercy      stick with the old design rather than risk
                                                                                               promotion for the AccelChip product
of the vendor in terms of the cost and          the entire project schedule.
                                                                                               family. He has more than fourteen years’
availability.                                                                                  experience in the EDA industry.
                                                The greatest benefits can be realized
Vendor-independent tools provide a more         by keeping the original floating-point         Reference:
flexible alternative. Once the design is        MATLAB source file as the golden source        1. Xilinx, June 2003.
                                                for all design and using algorithmic           2. Xilinx website.
captured, it can easily be retargeted from
one device family to another from the same      synthesis tools that synthesize from
                                                MATLAB to RTL. Using architectural             For further information, contact Eric at:
vendor, and can even be retargeted to an
entirely different family of FPGAs. Yet         synthesis tools such as AccelChip DSP
another advantage of vendor-independent         Synthesis, the algorithm designer can                           AccelChip Inc.
flows involves the need to retarget the same    make changes to the design well into                    1900 McCarthy Blvd., Suite 204
design to different silicon technologies.       the flow, resynthesize to RTL, and work
                                                with the hardware designer to determine                       Milpitas, CA 95035
Companies find that they can meet their
need for first silicon using FPGAs, and         whether the design performance and                  Tel: 408-943-0700 • Fax: 408-943-0661
then incorporate structured ASICs or            device utilization has improved. Possible
                                                                                                     E-mail: eric.cigan@accelchip.com
ASICs as they become available from             trade-offs given different levels of ab-
product lines. While vendor-supplied tools      straction are shown in Figure 3.                        Website: www.accelchip.com
provide IP that is only available for FPGA
devices, vendor-independent tools allow         Wrapping up
designers to retarget the designs without       Implementing DSP algorithms in silicon
changing the golden design source.              used to be a task reserved for only the
Reprinted from Embedded Computing Design / Spring 2004