Note 64 X
Note 64 X
S. Simion
Nevis Labs, Columbia University 13 September 2001
Abstract
There has been increasing interest in the Liquid Argon ROD group to consider the
latest generation of DSPs from TI as a possible solution for the LArg ROD system.
This note describes the status of an ongoing study of the C64x DSP.
1 Introduction
For the LAr ROD, the DSP performs two main computing tasks: a) running the optimal
filtering code to produce the energy, time, and pulse quality of the signal; and b) filling
histograms for online data monitoring.
These two tasks are qualitatively different in the way they exercise the DSP architecture,
so both need to be considered when choosing the DSP.
However, only the first task (a) is currently very well defined. We aim at designing a
system which is flexible enough, to cover some uncertainty in (b) and possibly to allow
other tasks to be added, such as lossless data compression.
The implementation of the optimal filtering algorithm which has been most extensively
tested and studied is the C6202 implementation (developed by S. Böttcher), also known
as the loop26 program [1]. This code was written specifically for the C6203 architecture.
When studying the C64x, it is important to carefully consider the substantial differences
in architecture between these DSPs.
Table 1 shows the key features of the C64x core as compared to the C6203 core.
Table 1: Key features of the C64x core versus the C6203 core.
C6203 C64x
General-purpose 32 64
registers
1
2 Performance of the C62x code on the C64x platform
The C64x platform supports the complete C6203 instruction set, so in principle the
loop26 program can run on the C64x with no modification. However, due to the
architecture differences, we do expect a loss in efficiency. Reasons for this include:
• Cross-path forwarding constraints, described in Section 5.6.2 of the Instruction Set
Reference [2].
• Banking structure of the L1D memory, described in Section 3.4.4 of the Peripherals
Reference [3].
As opposed to the C6203, the C64x does not have two separate memory blocks.
Therefore to access contiguous L1D memory blocks, one will access all banks. Bank
conflicts can still be avoided by suitable pointer arithmetic.
• L1D and L1P cache architecture, described in Section 3.4 of the Peripherals Reference.
The C64x has only a 16 kByte L1 data cache (L1D) running at CPU speed. The bulk of
the memory (L2 SRAM) is slower.
It is not possible to lock any data in the L1D cache.
It is reasonable to expect that the calibration constants will not stay in the L1D cache
from one event to the next, since we will be doing some histogramming in between.
The price to pay for a L1D miss (i.e. try to read data which is not cached) is anything
between 2 and 8 clock cycles, as shown in Table 3-8 of the Peripherals Reference [3].
By examination of the loop26 code, we identified 10 cross-path stalls which would occur
for every pair of channels (i.e. per loop iteration), should the code run on the C64x
platform.
We also identified 5 potential memory banking conflicts for every pair of channels, i.e.
5 execute packets with two word loads (LDW) in parallel. We disregarded writes, since
on the C64x these would most likely go directly to the L2 (remember that the L1D is a
write-through cache). Therefore, assuming that memory banking conflicts occur
randomly, and given that there are 8 32-bit wide banks, we predict 0.625 stalls on
average per pair of channels.
It is difficult, if not impossible, to predict the L1D cache performance without using a
simulator. For the remaining of this document, all performance results have been
obtained by using the Code Composer Studio1 (CCS) version 2 Cache Simulator .
Simulated performance results for the loop26 code running on the C64x platform are
shown in Table 2 below. Basically the effective cycle count has doubled (53 to 62 clock
cycles for two channels), but this is compensated by the C64x clock frequency of
600 MHz compared to 300 MHz for the current2 C6203. The results are in good
agreement with the predicted performance, and show that loop26 is not suitable for the
C64x DSP. We conclude that the optimal filtering code, and most of the existing DSP
software, must be completely rewritten for the C64x.
1
Part number TMDU324685C-07
2
We are told that C6203-400 MHz devices may be available 1Q 2002.
2
Table 2: Simulated performance results for loop26 running on the C64x platform. All numbers are
for 128 channels.
L1P misses 74 0
The purpose of this exercise was to write the same algorithm for the C64x DSP in order
to:
• Comply with the new resource constraints: eliminate the L1D bank conflicts and
cross-path stalls;
• Take advantage of some new opcodes which are not present in the C6203;
• Optimize the data accesses.
3.1 Simplified analysis of the algorithm in view of porting to the C64x platform
The loop26 code implements the calculation of E, t, and χ 2 in the following way:
4 n
E =
i = 0
∑
a i s i – a 5 ⋅ 2
(1)
4 m
T =
i = 0
∑b i s i – b 5 ⋅ 2 ⋅ T 1 ⁄ E
(2)
3
In these formulae, si are the five 12-bit ADC values, and a0 to a4 and b0 to b4 are 16-bit
signed calibration coefficients; a5 and b5 are 32-bit pedestals, also part of the calibration
coefficients. The correct energy scale, which accounts for the gain (high, medium, or
low) is obtained after multiplication by 2n, where n is a gain-dependent3 parameter also
part of the calibration coefficients. This allows use of the full precision of the 16-bit
multiplication coefficients (a0 to a4) even in high gain.
For the time calculation, the most significant bits of the energy fraction4 are used as an
index to the division table, to retrieve the 16-bit value T1/E. The multiplication by 2m is
necessary to account for the energy magnitude5.
The formula used to define the pulse shape χ 2 is:
∑
h i2
∑ ∑
E
------- ⋅ ------------- + 2p hi – 2h i s i
E 2 14 2 12
χ2 = ∑s ∑ ⋅
2 2
i – 2p s i + 5p + ------
- ---------------------------------------------------------------------------
- (3)
2 14 2 12
Here the 16-bit values –2h i are the calibration coefficients describing the pulse shape.
They are normalized to satisfy ∑ ai hi = 2 26 . Energy multiplications are performed with
bits 14 to 29 of the energy sum result. 2p ∑ h i and p 2 are 32-bit constants, and ∑ h2i ⁄ 212 is
stored in 16 bits.
In the 6203 implementation, all multiplications are 16-bit by 16-bit. Multiplications by
32-bit operands are actually multiplications by the upper 16 bits of these operands; the
lower 16 bits are lost.
The 6203 code which computes the energy uses 5 MPY instructions, and 6 arithmetic6
instructions (including the shift instruction used to scale by 2 n ). The time computation
uses 6 MPY instructions, and 6 arithmetic instructions. In addition, to compute the
index to the division table, 4 more arithmetic instructions are needed for every channel.
The χ 2 calculation uses 13 MPY instructions and 17 arithmetic instructions for each
channel, including the shift instruction used to compute E ⁄ 214 . A few more arithmetic
instructions are needed to pack the time and χ 2 information in a single 32-bit word.
Finally, more instructions are employed in pointer arithmetic (computation of the
calibration constants’ address as a function of the gain, etc). The corresponding
assembly code is not discussed here, since this part is not likely to benefit from the new
C64x instruction set. Only a global instruction count will be given.
Let us now turn towards the newer C64x instruction set. The most common operation
needed in our application is the scalar product of length 5 followed by pedestal
subtraction, which requires 5 ADD and 5 MPY instructions on the 6203. On the C64x,
3
The value n is also channel-dependent.
4
Any non-zero number may be written as 00001xxxxx... or 1.f where f is the fraction.
5
The word magnitude has its full meaning here, since m is directly related to log2E.
6
Instructions which are not loads, stores, or MPY are referred to as arithmetic instructions.
4
the same can be done with 2 DOTP2, one MPY, and 3 ADD instructions (provided the
input data are packed appropriately).
Furthermore, 32-bit by 16-bit multiplications (with right-shift and rounding to 32-bit
result) are available on the C64x. This type of multiplication will be used as follows:
• In the energy calculation, the final scaling by 2n becomes a scaling by an arbitrary
16-bit coefficient. This coefficient becomes part of the calibration constants in place of
n. The energy scale precision is improved, and we save an arithmetic operation.
• In the time calculation, the final 16-bit by 16-bit multiplication by T1 ⁄ E becomes a
32-bit by 16-bit multiplication. The division table remains in 16-bit precision, but we
do not throw away the lower 16 bits of the E*t sum any more. The shift operation is
still needed to take into account the energy magnitude.
• In the χ 2 calculation, the two 16-bit multiplications by E ⁄ 214 will be replaced with
two 32-bit multiplications by E. Once again, the improvement in precision comes for
free. The shift instruction used to compute E ⁄ 214 is eliminated.
Finally, if we wish to preserve the current packing of FEB data (done in hardware), then
two PACK2 instructions must be added, per channel, in order to make use of the DOTP2
multiplications.
Also new in the C64x are the MPY2 instructions. Although one such instruction could
replace two single MPY instructions, it usually requires special preparation (PACKing
of data) and does not save an ADD. Therefore MPY2 instructions are only justified if the
application is limited by MPYxx throughput.
Table 3 summarizes the instruction count needed to compute E, t, and χ 2 , for both
C6203 and C64x implementations. The instructions used in the core calculation are
counted exactly. The table also shows the total instruction count in loop26.
In conclusion, in principle it should be possible to write a C64x implementation of the
optimal filtering algorithm, using 15 to 16 execute packets per pair of channels.
Anticipating a little bit, we also list in Table 3 the total instruction count in the new
loop16 (C64x) implementation. However, to fully understand the loop16 implementation
requires reading through the remainder of this document.
In the C64x architecture, all input data must go through the L1D cache. The FEB data
for 128 channels, 5 samples, occupies 20 L1D cache lines. The calibration constants for
128 channels, one gain, occupy 104 cache lines. The division table occupies 128 cache
lines. The L1D size is 256 cache lines in 128 sets.
The algorithm needs 13 calibration data words per channel and per gain. Four words
are needed for energy calculation; 3.5 for time; and 5.5 for χ 2 . The memory layout of the
calibration constants may impact the L1D performance. A few different layouts may be
5
2
Table 3: Minimum number of instructions to compute the E, t, and χ calculation for one channel,
according to the formulae presented above.
Total memory 19 11
envisaged. Various possible organizations of calibration constants in L1D cache lines are
shown in Figure 1.
One possibility (A) is to use only the Type 0 arrangement, in which one cache line holds
the 13 calibration words for a single channel (for one gain). In this case, a total of 128
cache lines are needed for 128 channels (for one gain). This arrangement is most tolerant
to gain variation: since no channels share the same cache line, there will be no additional
L1D misses in case various channels come in different gains. However the
disadvantages are: a) 3 out of 16 words are lost7; and b) L1D misses cannot be pipelined
since there will be at most one L1D miss per channel.
The second possibility is to spread the calibration constants for one channel, among a
few cache lines. We would use a combination of Type 1, Type 2, and Type 3 cache lines.
7
Precisely, for every L1D cache miss, a full cache line will be read from L2, of which we use only 13 words.
6
Type 0 13 words for each channel
1 channel / L1D cache line
ch.d
Type 3
ch.b
ch.0
ch.1
ch.2
ch.3
ch.4
ch.5
ch.6
ch.7
ch.8
ch.9
ch.a
ch.e
ch.c
ch.f
16 channels / L1D cache line
One possibility is to use (B) one Type 1, one Type 2, and one Type 3 cache line for each
channel. Another possibility is (C) three Type 2 and one Type 3 cache lines for each
channel. In both cases, only 104 cache lines are needed for 128 channels. In case (A) we
have a total of 64 Type 1, 32 Type 2, and 8 Type 3 lines for 128 channels. In case (B) we
have 96 Type 2, and 8 Type 3 lines.
In case (B) we could arrange for one L1D miss every 4 channels for Type 1, one every 8
channels for Type 2, and one every 32 channels for Type 3. In case (C) we could arrange
for three L1D misses every 4 channels for Type 2, and one every 32 channels for Type 3.
Arrangement (C) allows best L1D miss pipelining but is most sensitive to gain
variation.
When loading the FEB input data for a pair of channels, we must perform two double
word loads (referred to as “FEB 1”) and one single-word load (referred to as “FEB 2”)
for a total of 5 32-bit words for 5 samples.
Table 4 shows the L1D miss pipelining which could be achieved. Different types of
calibration data are listed in decreasing order of the L1D miss rate, assuming that all
channels come in high gain. The incoming FEB data are listed in increasing order of the
L1D miss rate. For best L1D pipelining, the data should be read in this same order.
Ordering matters since we do not wish to isolate L1D misses occuring, for example,
every 8 channels, by inserting a memory access which causes a L1D miss only every 32
channels.
Following the same argument, we do not wish to interleave FEB input data reads, on
one side, and calibration constants reads, on the other side. Reading FEB data is
guaranteed to cause L1D misses, while calibration constants might remain in the L1D
cache from one event to the next. Even if this is not likely to happen, it is not desirable
to leave the possibility of gaps between accesses to FEB data.
A tentative implementation of scheme (A) is shown in Figure 2, scheme (B) in Figure 3,
and scheme (C) in Figure 4.
7
Table 4: L1D pipelining for E-t-c2 algorithm.
Type 1 0 64 0 0 2 every 4 0
LDW.D1 x4y4 ; type “2” FEB data for 2 channels, miss every 32 channels
LDDW.D1 x0y0:x1y1 ; type “1” FEB data for 2 channels, miss every 16 channels
|| LDDW.D2 x2y2:x3y3 ; type “1” FEB data for 2 channels, miss every 16 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “0” calibration data, miss every 2 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “0” calibration data, miss every 2 channels
...
{loading the remaining 22 words of calibration coefficients, no L1D miss}
Figure 2: C64x implementation of scheme A.
LDW.D1 x4y4 ; type “2” FEB data for 2 channels, miss every 32 channels
LDDW.D1 x0y0:x1y1 ; type “1” FEB data for 2 channels, miss every 16 channels
|| LDDW.D2 x2y2:x3y3 ; type “1” FEB data for 2 channels, miss every 16 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “1” calibration data, miss every 4 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “1” calibration data, miss every 4 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDW.D1 {calibration coefficients for channel “x”} ; type “3” calibration data, miss every 32 channels
|| LDW.D2 {calibration coefficients for channel “y”} ; type “3” calibration data, miss every 32 channels
...
{loading the remaining 16 words of calibration coefficients (12 type “1” and 4 type “2”), no L1D miss}
Figure 3: C64x implementation of scheme B
The L1D performance has been simulated for each case, using the Code Composer
Cache simulator. As explained above, we expect a performance degradation in cases (B)
and (C) if all 128 channels do not come in the same gain, since Type 1, Type 2, and Type 3
cache lines all contain calibration data for different channels but for the same gain.
Therefore the following scenarios have been simulated: a) all channels are in high gain;
b) some fraction of the channels (5%, 10%, 20%) are in medium gain, with all other
channels in high gain.
The simulation has been performed with a dummy C64x code performing only the data
accesses, not the real optimal filtering and χ 2 computation. However the spacing
between every pair of channels was 17 cycles, as estimated in Section 3.1.
8
LDW.D1 x4y4 ; type “2” FEB data for 2 channels, miss every 32 channels
LDDW.D1 x0y0:x1y1 ; type “1” FEB data for 2 channels, miss every 16 channels
|| LDDW.D2 x2y2:x3y3 ; type “1” FEB data for 2 channels, miss every 16 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDW.D1 {calibration coefficients for channel “x”} ; type “3” calibration data, miss every 32 channels
|| LDW.D2 {calibration coefficients for channel “y”} ; type “3” calibration data, miss every 32 channels
...
{loading the remaining 12 words of (type “2”) calibration coefficients, no L1D miss}
Figure 4: C64x implementation of scheme C
The main simulation result is shown in Table 5. Although scheme (C) is most sensitive
to gain variation, it outperforms schemes (A) and (B) even for 20% of the channels in
medium gain. This is of course due to the excellent L1D miss pipelining achieved in this
configuration. We can see from Table 6 that the average penalty for a L1D miss in
scheme (C) can be as low as 2.5 CPU clocks, compared to 4.1 clocks for scheme (A). Also
note (from Table 2) that the loop26 had an average L1D miss penalty of 5.4 clocks.
Table 5: L1D stall cycles per FEB, for a cold start
Gains A B C
Table 6: L1D stall cycles per L1D read miss, for a cold start
Gains A B C
9
Table 7: L1D read misses per FEB, for a cold start
Gains A B C
For sake of completeness8, Table 7 shows the total number of L1D misses per event. The
same results are shown graphically in Figure 5.
Note that we have not yet considered accesses to the divison table. Only reading of
input FEB data and calibration constants have been considered so far. Nevertheless,
scheme (C) has been selected for the C64x loop16 implementation.
Currently the algorithm uses a division table with 212 = 4096 entries which are 16-bit
words. To achieve the desired precision, the most significant 12 bits of the energy
fraction, are used as an index to the division table. For example, the same entry in the
table is used to divide by 3, 6, 12, 24 ... etc. (All these numbers all have their fraction
equal to 0x800.)
The same entry in the division table is used for a large range of energies. Conversely,
small energies may require division table entries which are far apart.
More precisely, all odd positive integers have different fractions. Therefore, all odd
positive integer energies smaller than 213 will use different division table entries.
On the C64x, the L1D cache set index is given by bits 6 to 12 of the physical address. For
the division table, this means the most significant 7 bits of the energy fraction. As a
consequence, all odd positive integer energies smaller than 28 will use different division
table L1D cache lines.
The number of division table cache lines needed increases linearly as E/2 until it
reaches the maximum of 128.
Division table read misses will occur randomly. However, in the C64x implementation,
we tried to place these reads close to the other reads which are known to generate L1D
misses.
Since reading from the division table is expensive, in the loop16 implementation we
allow a slight deviation from the loop26 philosophy: an energy cut may be specified,
which determines for which channels the time is to be computed. Effectively, only the
8
Table 5, Table 6, and Table 7 are not independent.
10
650
A
600
550
450
B
400
C
350
300
0 5 10 15 20 25
Fraction of channels in medium gain
4.2
4
A
3.8
3.6
L1D stalls / Read miss
3.4
B
3.2
2.8
C
2.6
2.4
0 5 10 15 20 25
Fraction of channels in medium gain
200
C
190
180
B
170
L1D misses / event
160
A
150
140
130
120
0 5 10 15 20 25
Fraction of channels in medium gain
Figure 5: C64x L1D performance for reading FEB data and calibration coefficients.
11
retrieval of 1/E, and subsequent storage of time and χ 2 , are suppressed for channels
with E < Ecut. This feature can be of course bypassed in which case the program
computes and stores the time for all channels.
Table 8: Simulated performance results for loop16 running on the C64x platform. All numbers are
for 128 channels. All channels are in high gain. The program is in the L1P cache.
Cross-path stalls 0 0
Table 8 shows the simulated performance of the loop16 code. Although the number of
execute packets per pair of channels has been reduced from 26 to 16, the effective cycle
count to compute E, t, and χ 2 for all channels is almost double, mostly due to the L1D
cache inefficiency. The average penalty per L1D miss is 3.3 cycles, slightly worse than
the 2.54 cycles achieved with scheme (C). This is of course due to accesses to the division
table, which occur randomly, so miss pipelining is less efficient.
In case a very large energy cut is specified, the accesses to the division table are absent
and the L1D miss penalty becomes 2.9 cycles. Again this is slightly higher than the 2.54
cycles achieved with scheme (C). In fact we had to deviate slightly from scheme (C) to
avoid increasing the loop latency, which would have resulted in much higher register
pressure.
4 Conclusion
This note shows the first steps in the evaluation of the C64x DSP for the ROD. The core
calculation of the energy, time, and χ 2 has been investigated. We determined that it is
possible to perform this calculation in 15 or 16 cycles (execute packets) per pair of
channels by taking advantage of the newer C64x instruction set.
However, it became clear that a thorough understanding of the C64x memory
architecture, not just instruction set, is needed to write efficient code for the C64x. The
organization of data in memory, and proper scheduling of data accesses, is critical to
achieve good performance.
12
Due to the C64x cache architecture, the actual performance can only be determined
accurately by using the cache simulator. As a result, many versions of the code had to
be actually implemented to test various hypotheses. This makes the code development
for the C64x more difficult, and less predictable, than the C6203.
In the particular case of loop16, effectively 26.9 CPU clock cycles are needed to compute
the energy, time, and χ 2 for each pair of channels, provided all channels are in high gain.
This corresponds to an execution time of 2.9 µs on a 600 MHz DSP for one FEB (128
channels).
The cycle count can be reduced by not performing the time computation for channels
below a given energy threshold.
The 64x performance while performing other tasks like histogramming, was not fully
investigated yet. Filling histograms is expected to exacerbate the C64x memory access
inefficiency. Even with excellent miss pipelining, any access to a histogram bin would
cost at least 2 CPU stall cycles. In addition, histogramming or any other task requiring
a non-negligible amount of memory (compared to the L1D size) is likely to push the
calibration constants and/or division table entries out of the L1D, thus causing
worst-case timings for the main optimal filtering computation itself.
References
13