0% found this document useful (0 votes)

15 views13 pages

Note 64 X

This document evaluates the TI C64x DSP for the Liquid Argon ROD system, focusing on its potential to perform optimal filtering and data monitoring tasks. It highlights the architectural differences between the C64x and the previous C6203 DSP, noting that the existing code would require significant rewriting to function efficiently on the C64x. Performance simulations indicate that the C64x is not suitable for the current implementation of the loop26 code, necessitating a new approach to optimize the algorithm for the C64x platform.

Uploaded by

azer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views13 pages

Note 64 X

Uploaded by

azer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Evaluating the TI C64x DSP for the Liquid Argon ROD

S. Simion
Nevis Labs, Columbia University 13 September 2001

Abstract
There has been increasing interest in the Liquid Argon ROD group to consider the
latest generation of DSPs from TI as a possible solution for the LArg ROD system.
This note describes the status of an ongoing study of the C64x DSP.

1 Introduction

For the LAr ROD, the DSP performs two main computing tasks: a) running the optimal
filtering code to produce the energy, time, and pulse quality of the signal; and b) filling
histograms for online data monitoring.
These two tasks are qualitatively different in the way they exercise the DSP architecture,
so both need to be considered when choosing the DSP.
However, only the first task (a) is currently very well defined. We aim at designing a
system which is flexible enough, to cover some uncertainty in (b) and possibly to allow
other tasks to be added, such as lossless data compression.
The implementation of the optimal filtering algorithm which has been most extensively
tested and studied is the C6202 implementation (developed by S. Böttcher), also known
as the loop26 program [1]. This code was written specifically for the C6203 architecture.
When studying the C64x, it is important to carefully consider the substantial differences
in architecture between these DSPs.
Table 1 shows the key features of the C64x core as compared to the C6203 core.
Table 1: Key features of the C64x core versus the C6203 core.

C6203 C64x

Frequency 300 MHz 600 MHz

Total SRAM 896 KB 1024 KB L2 SRAM
shared program / data

Program memory 384 KB in two blocks 16 KB L1P cache

Data memory 512 KB in two blocks 16 KB L1D cache

Four 16-bit banks per block two-way set associative
64 byte line size and 128 sets
Eight 32-bit banks

General-purpose 32 64
registers

1
2 Performance of the C62x code on the C64x platform

The C64x platform supports the complete C6203 instruction set, so in principle the
loop26 program can run on the C64x with no modification. However, due to the
architecture differences, we do expect a loss in efficiency. Reasons for this include:
• Cross-path forwarding constraints, described in Section 5.6.2 of the Instruction Set
Reference [2].
• Banking structure of the L1D memory, described in Section 3.4.4 of the Peripherals
Reference [3].
As opposed to the C6203, the C64x does not have two separate memory blocks.
Therefore to access contiguous L1D memory blocks, one will access all banks. Bank
conflicts can still be avoided by suitable pointer arithmetic.
• L1D and L1P cache architecture, described in Section 3.4 of the Peripherals Reference.
The C64x has only a 16 kByte L1 data cache (L1D) running at CPU speed. The bulk of
the memory (L2 SRAM) is slower.
It is not possible to lock any data in the L1D cache.
It is reasonable to expect that the calibration constants will not stay in the L1D cache
from one event to the next, since we will be doing some histogramming in between.
The price to pay for a L1D miss (i.e. try to read data which is not cached) is anything
between 2 and 8 clock cycles, as shown in Table 3-8 of the Peripherals Reference [3].

By examination of the loop26 code, we identified 10 cross-path stalls which would occur
for every pair of channels (i.e. per loop iteration), should the code run on the C64x
platform.
We also identified 5 potential memory banking conflicts for every pair of channels, i.e.
5 execute packets with two word loads (LDW) in parallel. We disregarded writes, since
on the C64x these would most likely go directly to the L2 (remember that the L1D is a
write-through cache). Therefore, assuming that memory banking conflicts occur
randomly, and given that there are 8 32-bit wide banks, we predict 0.625 stalls on
average per pair of channels.
It is difficult, if not impossible, to predict the L1D cache performance without using a
simulator. For the remaining of this document, all performance results have been
obtained by using the Code Composer Studio1 (CCS) version 2 Cache Simulator .
Simulated performance results for the loop26 code running on the C64x platform are
shown in Table 2 below. Basically the effective cycle count has doubled (53 to 62 clock
cycles for two channels), but this is compensated by the C64x clock frequency of
600 MHz compared to 300 MHz for the current2 C6203. The results are in good
agreement with the predicted performance, and show that loop26 is not suitable for the
C64x DSP. We conclude that the optimal filtering code, and most of the existing DSP
software, must be completely rewritten for the C64x.

1
Part number TMDU324685C-07
2
We are told that C6203-400 MHz devices may be available 1Q 2002.

2
Table 2: Simulated performance results for loop26 running on the C64x platform. All numbers are
for 128 channels.

Cold starta Warm programb

Cross-path stalls 576 576

L1D bank conflicts 33 31.7

L1D stall cycles 1240 1244.8

L1D read misses 234 of which: 232.2

division table = 85, FEB data = 21,
calibration coefficients = 128

L1P stall cycles 395 0

L1P misses 74 0

Total CPU clock 3954 3400.0

cycles

Effective CPU clock 61.8 53.1

cycles per pair of
channels
a. The code is not present in the L1P cache, and the input data are not present in the L1D cache.
b. Only the code is present in the L1P cache; the input data are not in the L1D cache.

3 Implementation of the optimal filtering algorithm on the C64x platform

The purpose of this exercise was to write the same algorithm for the C64x DSP in order
to:
• Comply with the new resource constraints: eliminate the L1D bank conflicts and
cross-path stalls;
• Take advantage of some new opcodes which are not present in the C6203;
• Optimize the data accesses.

3.1 Simplified analysis of the algorithm in view of porting to the C64x platform

The loop26 code implements the calculation of E, t, and χ 2 in the following way:

 4  n
E = 
i = 0
∑
a i s i – a 5 ⋅ 2

(1)

 4  m
T = 
i = 0
∑b i s i – b 5 ⋅ 2  ⋅ T 1 ⁄ E
 
(2)

3
In these formulae, si are the five 12-bit ADC values, and a0 to a4 and b0 to b4 are 16-bit
signed calibration coefficients; a5 and b5 are 32-bit pedestals, also part of the calibration
coefficients. The correct energy scale, which accounts for the gain (high, medium, or
low) is obtained after multiplication by 2n, where n is a gain-dependent3 parameter also
part of the calibration coefficients. This allows use of the full precision of the 16-bit
multiplication coefficients (a0 to a4) even in high gain.
For the time calculation, the most significant bits of the energy fraction4 are used as an
index to the division table, to retrieve the 16-bit value T1/E. The multiplication by 2m is
necessary to account for the energy magnitude5.
The formula used to define the pulse shape χ 2 is:

∑
h i2
∑ ∑
E
------- ⋅ ------------- + 2p hi – 2h i s i
E 2 14 2 12
χ2 = ∑s ∑ ⋅
2 2
i – 2p s i + 5p + ------
- ---------------------------------------------------------------------------
- (3)
2 14 2 12
Here the 16-bit values –2h i are the calibration coefficients describing the pulse shape.
They are normalized to satisfy ∑ ai hi = 2 26 . Energy multiplications are performed with
bits 14 to 29 of the energy sum result. 2p ∑ h i and p 2 are 32-bit constants, and ∑ h2i ⁄ 212 is
stored in 16 bits.
In the 6203 implementation, all multiplications are 16-bit by 16-bit. Multiplications by
32-bit operands are actually multiplications by the upper 16 bits of these operands; the
lower 16 bits are lost.
The 6203 code which computes the energy uses 5 MPY instructions, and 6 arithmetic6
instructions (including the shift instruction used to scale by 2 n ). The time computation
uses 6 MPY instructions, and 6 arithmetic instructions. In addition, to compute the
index to the division table, 4 more arithmetic instructions are needed for every channel.
The χ 2 calculation uses 13 MPY instructions and 17 arithmetic instructions for each
channel, including the shift instruction used to compute E ⁄ 214 . A few more arithmetic
instructions are needed to pack the time and χ 2 information in a single 32-bit word.
Finally, more instructions are employed in pointer arithmetic (computation of the
calibration constants’ address as a function of the gain, etc). The corresponding
assembly code is not discussed here, since this part is not likely to benefit from the new
C64x instruction set. Only a global instruction count will be given.
Let us now turn towards the newer C64x instruction set. The most common operation
needed in our application is the scalar product of length 5 followed by pedestal
subtraction, which requires 5 ADD and 5 MPY instructions on the 6203. On the C64x,

3
The value n is also channel-dependent.
4
Any non-zero number may be written as 00001xxxxx... or 1.f where f is the fraction.
5
The word magnitude has its full meaning here, since m is directly related to log2E.
6
Instructions which are not loads, stores, or MPY are referred to as arithmetic instructions.

4
the same can be done with 2 DOTP2, one MPY, and 3 ADD instructions (provided the
input data are packed appropriately).
Furthermore, 32-bit by 16-bit multiplications (with right-shift and rounding to 32-bit
result) are available on the C64x. This type of multiplication will be used as follows:
• In the energy calculation, the final scaling by 2n becomes a scaling by an arbitrary
16-bit coefficient. This coefficient becomes part of the calibration constants in place of
n. The energy scale precision is improved, and we save an arithmetic operation.
• In the time calculation, the final 16-bit by 16-bit multiplication by T1 ⁄ E becomes a
32-bit by 16-bit multiplication. The division table remains in 16-bit precision, but we
do not throw away the lower 16 bits of the E*t sum any more. The shift operation is
still needed to take into account the energy magnitude.
• In the χ 2 calculation, the two 16-bit multiplications by E ⁄ 214 will be replaced with
two 32-bit multiplications by E. Once again, the improvement in precision comes for
free. The shift instruction used to compute E ⁄ 214 is eliminated.

Finally, if we wish to preserve the current packing of FEB data (done in hardware), then
two PACK2 instructions must be added, per channel, in order to make use of the DOTP2
multiplications.
Also new in the C64x are the MPY2 instructions. Although one such instruction could
replace two single MPY instructions, it usually requires special preparation (PACKing
of data) and does not save an ADD. Therefore MPY2 instructions are only justified if the
application is limited by MPYxx throughput.
Table 3 summarizes the instruction count needed to compute E, t, and χ 2 , for both
C6203 and C64x implementations. The instructions used in the core calculation are
counted exactly. The table also shows the total instruction count in loop26.
In conclusion, in principle it should be possible to write a C64x implementation of the
optimal filtering algorithm, using 15 to 16 execute packets per pair of channels.
Anticipating a little bit, we also list in Table 3 the total instruction count in the new
loop16 (C64x) implementation. However, to fully understand the loop16 implementation
requires reading through the remainder of this document.

3.2 Data access issues

In the C64x architecture, all input data must go through the L1D cache. The FEB data
for 128 channels, 5 samples, occupies 20 L1D cache lines. The calibration constants for
128 channels, one gain, occupy 104 cache lines. The division table occupies 128 cache
lines. The L1D size is 256 cache lines in 128 sets.

3.2.1 Calibration constants and FEB input data

The algorithm needs 13 calibration data words per channel and per gain. Four words
are needed for energy calculation; 3.5 for time; and 5.5 for χ 2 . The memory layout of the
calibration constants may impact the L1D performance. A few different layouts may be

5
2
Table 3: Minimum number of instructions to compute the E, t, and χ calculation for one channel,
according to the formulae presented above.

Instruction type 6203 slots 64x slots Modifications

Core arithmetic 33 25 + n +2 PACK –8 ADD

–2 SHIFT
+n PACK

Calibration constants 13 LDW 6 LDDW + 1 LDW

FEB dataa 5/2 LDW 1 LDDW + 1/2 LDW

Load

Gain wordsb 1/2 LDW 1/2

1/E from division table 1 LDH 1 LDH
2
Store E, t, χ 2 STW 1 STDW

Total memory 19 11

Total core arithmetic and memory 52 36 + n

Multiply 24 MPY +8 DOTP –16 MPY

16.5 – nc +4 MPLIR –3 MPY
n = 0, 1, 2 +n MPY2 –2n MPY
+1/2 MPY2d –MPY

Total instruction count, actual loop26 loop16

implementation 98 of 104 61 of 64
a. Five 32-bit words hold the FEB data for two channels.
b. The input FPGA removes the gain bits from the FEB ADC words, and packs the gain bits into dedicated 32-bit
words. The DSP only reads new gain words from memory when necessary, not every channel.
c. Between 0 and 2 MPY2 instructions may be used for each channel, each one replacing two MPY instructions.
These instructions do require special packing of data. The total instruction count is unchanged.
d. One MPY2 instruction for two channels, used in the χ2 calculation. In this unique case, special packing of data is
not required.

envisaged. Various possible organizations of calibration constants in L1D cache lines are
shown in Figure 1.
One possibility (A) is to use only the Type 0 arrangement, in which one cache line holds
the 13 calibration words for a single channel (for one gain). In this case, a total of 128
cache lines are needed for 128 channels (for one gain). This arrangement is most tolerant
to gain variation: since no channels share the same cache line, there will be no additional
L1D misses in case various channels come in different gains. However the
disadvantages are: a) 3 out of 16 words are lost7; and b) L1D misses cannot be pipelined
since there will be at most one L1D miss per channel.
The second possibility is to spread the calibration constants for one channel, among a
few cache lines. We would use a combination of Type 1, Type 2, and Type 3 cache lines.

7
Precisely, for every L1D cache miss, a full cache line will be read from L2, of which we use only 13 words.

6
Type 0 13 words for each channel
1 channel / L1D cache line

Type 1 8 words for channel 0 8 words for channel 1

2 channels / L1D cache line

Type 2 channel 0 channel 1 channel 2 channel 3

4 channels / L1D cache line

ch.d
Type 3

ch.b
ch.0
ch.1
ch.2
ch.3
ch.4
ch.5
ch.6
ch.7
ch.8
ch.9
ch.a

ch.e
ch.c

ch.f
16 channels / L1D cache line

Figure 1: Possible organizations of calibration constants in L1D cache lines.

One possibility is to use (B) one Type 1, one Type 2, and one Type 3 cache line for each
channel. Another possibility is (C) three Type 2 and one Type 3 cache lines for each
channel. In both cases, only 104 cache lines are needed for 128 channels. In case (A) we
have a total of 64 Type 1, 32 Type 2, and 8 Type 3 lines for 128 channels. In case (B) we
have 96 Type 2, and 8 Type 3 lines.
In case (B) we could arrange for one L1D miss every 4 channels for Type 1, one every 8
channels for Type 2, and one every 32 channels for Type 3. In case (C) we could arrange
for three L1D misses every 4 channels for Type 2, and one every 32 channels for Type 3.
Arrangement (C) allows best L1D miss pipelining but is most sensitive to gain
variation.
When loading the FEB input data for a pair of channels, we must perform two double
word loads (referred to as “FEB 1”) and one single-word load (referred to as “FEB 2”)
for a total of 5 32-bit words for 5 samples.
Table 4 shows the L1D miss pipelining which could be achieved. Different types of
calibration data are listed in decreasing order of the L1D miss rate, assuming that all
channels come in high gain. The incoming FEB data are listed in increasing order of the
L1D miss rate. For best L1D pipelining, the data should be read in this same order.
Ordering matters since we do not wish to isolate L1D misses occuring, for example,
every 8 channels, by inserting a memory access which causes a L1D miss only every 32
channels.
Following the same argument, we do not wish to interleave FEB input data reads, on
one side, and calibration constants reads, on the other side. Reading FEB data is
guaranteed to cause L1D misses, while calibration constants might remain in the L1D
cache from one event to the next. Even if this is not likely to happen, it is not desirable
to leave the possibility of gaps between accesses to FEB data.
A tentative implementation of scheme (A) is shown in Figure 2, scheme (B) in Figure 3,
and scheme (C) in Figure 4.

7
Table 4: L1D pipelining for E-t-c2 algorithm.

L1D lines for 128 channels L1D read miss occurence

Data
A B C A B C

FEB 2 4 1 every 32 channels

FEB 1 16 2 every 16 channels

Type 0 128 0 0 2 every 2 0 0

Type 1 0 64 0 0 2 every 4 0

Type 2 0 32 96 0 2 every 8 6 every 8

Type 3 0 8 8 0 2 every 32 2 every 32

LDW.D1 x4y4 ; type “2” FEB data for 2 channels, miss every 32 channels
LDDW.D1 x0y0:x1y1 ; type “1” FEB data for 2 channels, miss every 16 channels
|| LDDW.D2 x2y2:x3y3 ; type “1” FEB data for 2 channels, miss every 16 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “1” calibration data, miss every 4 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “1” calibration data, miss every 4 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDW.D1 {calibration coefficients for channel “x”} ; type “3” calibration data, miss every 32 channels
|| LDW.D2 {calibration coefficients for channel “y”} ; type “3” calibration data, miss every 32 channels
...
{loading the remaining 16 words of calibration coefficients (12 type “1” and 4 type “2”), no L1D miss}
Figure 3: C64x implementation of scheme B

The L1D performance has been simulated for each case, using the Code Composer
Cache simulator. As explained above, we expect a performance degradation in cases (B)
and (C) if all 128 channels do not come in the same gain, since Type 1, Type 2, and Type 3
cache lines all contain calibration data for different channels but for the same gain.
Therefore the following scenarios have been simulated: a) all channels are in high gain;
b) some fraction of the channels (5%, 10%, 20%) are in medium gain, with all other
channels in high gain.
The simulation has been performed with a dummy C64x code performing only the data
accesses, not the real optimal filtering and χ 2 computation. However the spacing
between every pair of channels was 17 cycles, as estimated in Section 3.1.

8
LDW.D1 x4y4 ; type “2” FEB data for 2 channels, miss every 32 channels
LDDW.D1 x0y0:x1y1 ; type “1” FEB data for 2 channels, miss every 16 channels
|| LDDW.D2 x2y2:x3y3 ; type “1” FEB data for 2 channels, miss every 16 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDDW.D1 {calibration coefficients for channel “x”} ; type “2” calibration data, miss every 8 channels
|| LDDW.D2 {calibration coefficients for channel “y”} ; type “2” calibration data, miss every 8 channels
LDW.D1 {calibration coefficients for channel “x”} ; type “3” calibration data, miss every 32 channels
|| LDW.D2 {calibration coefficients for channel “y”} ; type “3” calibration data, miss every 32 channels
...
{loading the remaining 12 words of (type “2”) calibration coefficients, no L1D miss}
Figure 4: C64x implementation of scheme C

The main simulation result is shown in Table 5. Although scheme (C) is most sensitive
to gain variation, it outperforms schemes (A) and (B) even for 20% of the channels in
medium gain. This is of course due to the excellent L1D miss pipelining achieved in this
configuration. We can see from Table 6 that the average penalty for a L1D miss in
scheme (C) can be as low as 2.5 CPU clocks, compared to 4.1 clocks for scheme (A). Also
note (from Table 2) that the loop26 had an average L1D miss penalty of 5.4 clocks.
Table 5: L1D stall cycles per FEB, for a cold start

Gains A B C

All channels high gain 610.1 395.1 315.9

5% medium gain 479.4 428.4

10% medium gain 532.1 519.6

20% medium gain 604.7 593.1

Table 6: L1D stall cycles per L1D read miss, for a cold start

Gains A B C

All channels high gain 4.12 3.18 2.54

5% medium gain 3.34 2.85

10% medium gain 3.40 3.01

20% medium gain 3.44 3.05

9
Table 7: L1D read misses per FEB, for a cold start

Gains A B C

All channels high gain 148 124 124

5% medium gain 143.3 150.3

10% medium gain 156.4 172.7

20% medium gain 175.6 194.5

For sake of completeness8, Table 7 shows the total number of L1D misses per event. The
same results are shown graphically in Figure 5.
Note that we have not yet considered accesses to the divison table. Only reading of
input FEB data and calibration constants have been considered so far. Nevertheless,
scheme (C) has been selected for the C64x loop16 implementation.

3.2.2 Division table

Currently the algorithm uses a division table with 212 = 4096 entries which are 16-bit
words. To achieve the desired precision, the most significant 12 bits of the energy
fraction, are used as an index to the division table. For example, the same entry in the
table is used to divide by 3, 6, 12, 24 ... etc. (All these numbers all have their fraction
equal to 0x800.)
The same entry in the division table is used for a large range of energies. Conversely,
small energies may require division table entries which are far apart.
More precisely, all odd positive integers have different fractions. Therefore, all odd
positive integer energies smaller than 213 will use different division table entries.
On the C64x, the L1D cache set index is given by bits 6 to 12 of the physical address. For
the division table, this means the most significant 7 bits of the energy fraction. As a
consequence, all odd positive integer energies smaller than 28 will use different division
table L1D cache lines.
The number of division table cache lines needed increases linearly as E/2 until it
reaches the maximum of 128.
Division table read misses will occur randomly. However, in the C64x implementation,
we tried to place these reads close to the other reads which are known to generate L1D
misses.

3.3 Loop16 implementation and performance

Since reading from the division table is expensive, in the loop16 implementation we
allow a slight deviation from the loop26 philosophy: an energy cut may be specified,
which determines for which channels the time is to be computed. Effectively, only the

8
Table 5, Table 6, and Table 7 are not independent.

10
650

A
600

550

L1D stalls / event

500

450
B
400
C
350

300
0 5 10 15 20 25
Fraction of channels in medium gain

4.2

4
A
3.8

3.6
L1D stalls / Read miss

3.4
B
3.2

2.8
C
2.6

2.4
0 5 10 15 20 25
Fraction of channels in medium gain

200

C
190

180

B
170
L1D misses / event

160

A
150

140

130

120
0 5 10 15 20 25
Fraction of channels in medium gain

Figure 5: C64x L1D performance for reading FEB data and calibration coefficients.

11
retrieval of 1/E, and subsequent storage of time and χ 2 , are suppressed for channels
with E < Ecut. This feature can be of course bypassed in which case the program
computes and stores the time for all channels.
Table 8: Simulated performance results for loop16 running on the C64x platform. All numbers are
for 128 channels. All channels are in high gain. The program is in the L1P cache.

All channels with time infor- No channels with time infor-

mation mation

Cross-path stalls 0 0

L1D bank conflicts 1.6a 0

L1D stall cycles 675.7 353

L1D read misses 204.9 122.6

Total CPU clock cycles 1721 1396

Effective CPU clock cycles 26.9 21.8

per pair of channels
a. Due to parallel accesses to the division table.

Table 8 shows the simulated performance of the loop16 code. Although the number of
execute packets per pair of channels has been reduced from 26 to 16, the effective cycle
count to compute E, t, and χ 2 for all channels is almost double, mostly due to the L1D
cache inefficiency. The average penalty per L1D miss is 3.3 cycles, slightly worse than
the 2.54 cycles achieved with scheme (C). This is of course due to accesses to the division
table, which occur randomly, so miss pipelining is less efficient.
In case a very large energy cut is specified, the accesses to the division table are absent
and the L1D miss penalty becomes 2.9 cycles. Again this is slightly higher than the 2.54
cycles achieved with scheme (C). In fact we had to deviate slightly from scheme (C) to
avoid increasing the loop latency, which would have resulted in much higher register
pressure.

4 Conclusion

This note shows the first steps in the evaluation of the C64x DSP for the ROD. The core
calculation of the energy, time, and χ 2 has been investigated. We determined that it is
possible to perform this calculation in 15 or 16 cycles (execute packets) per pair of
channels by taking advantage of the newer C64x instruction set.
However, it became clear that a thorough understanding of the C64x memory
architecture, not just instruction set, is needed to write efficient code for the C64x. The
organization of data in memory, and proper scheduling of data accesses, is critical to
achieve good performance.

12
Due to the C64x cache architecture, the actual performance can only be determined
accurately by using the cache simulator. As a result, many versions of the code had to
be actually implemented to test various hypotheses. This makes the code development
for the C64x more difficult, and less predictable, than the C6203.
In the particular case of loop16, effectively 26.9 CPU clock cycles are needed to compute
the energy, time, and χ 2 for each pair of channels, provided all channels are in high gain.
This corresponds to an execution time of 2.9 µs on a 600 MHz DSP for one FEB (128
channels).
The cycle count can be reduced by not performing the time computation for channels
below a given energy threshold.
The 64x performance while performing other tasks like histogramming, was not fully
investigated yet. Filling histograms is expected to exacerbate the C64x memory access
inefficiency. Even with excellent miss pipelining, any access to a histogram bin would
cost at least 2 CPU stall cycles. In addition, histogramming or any other task requiring
a non-negligible amount of memory (compared to the L1D size) is likely to push the
calibration constants and/or division table entries out of the L1D, thus causing
worst-case timings for the main optimal filtering computation itself.

References

1 A brief description of the loop26 algorithm can be found at:

http://www.nevis.columbia.edu/~stephan/lhc/rod/dsp/algo26.ps
2 TMS320C6000 CPU and Instruction Set Reference Guide SPRU189F.
3 TMS320C6000 Peripherals Reference Guide SPRU190D.

DSP Lab Manual
No ratings yet
DSP Lab Manual
26 pages
Ecen-106 HW7
No ratings yet
Ecen-106 HW7
9 pages
Class Test 2022 Answers Wrong
No ratings yet
Class Test 2022 Answers Wrong
6 pages
Ds..... P Lab Manual
No ratings yet
Ds..... P Lab Manual
144 pages
C6000 Embedded Design Guide
No ratings yet
C6000 Embedded Design Guide
164 pages
DSP Architecture for Engineers
50% (2)
DSP Architecture for Engineers
82 pages
Am 625
No ratings yet
Am 625
260 pages
Digital Signal Processing Lab Manual
No ratings yet
Digital Signal Processing Lab Manual
82 pages
EE445L Final Exam Solutions
No ratings yet
EE445L Final Exam Solutions
6 pages
Embedded Systems Guide
No ratings yet
Embedded Systems Guide
12 pages
New Lec18
No ratings yet
New Lec18
28 pages
Design of 8-Bit Sar-Adc Cmos
No ratings yet
Design of 8-Bit Sar-Adc Cmos
4 pages
DSP Practicle NO. 5 - 9
No ratings yet
DSP Practicle NO. 5 - 9
17 pages
Solutions Assignment-Semester-Ec3
No ratings yet
Solutions Assignment-Semester-Ec3
12 pages
32-Bit Microprocessor Analysis
No ratings yet
32-Bit Microprocessor Analysis
13 pages
MP Mid Term
No ratings yet
MP Mid Term
4 pages
Data Sheet
No ratings yet
Data Sheet
60 pages
Enel3dsh1 6 2023 y P1
No ratings yet
Enel3dsh1 6 2023 y P1
17 pages
Tms320c64x Architecture
No ratings yet
Tms320c64x Architecture
29 pages
DSP-Lab REC-553
No ratings yet
DSP-Lab REC-553
66 pages
AESD Vlsi
No ratings yet
AESD Vlsi
6 pages
Study of Architecture of DSP TMS320C6748
No ratings yet
Study of Architecture of DSP TMS320C6748
9 pages
Features of 80386
No ratings yet
Features of 80386
13 pages
DSP Architecture for Engineers
No ratings yet
DSP Architecture for Engineers
33 pages
Digital Signal Processing
No ratings yet
Digital Signal Processing
20 pages
Architecture of 80386
No ratings yet
Architecture of 80386
39 pages
Real-Time DSP: ECE 5655/4655 Lecture Notes
100% (1)
Real-Time DSP: ECE 5655/4655 Lecture Notes
34 pages
Real-Time DSP: ECE 5655/4655 Lecture Notes
No ratings yet
Real-Time DSP: ECE 5655/4655 Lecture Notes
34 pages
80386 Microprocessor Overview
No ratings yet
80386 Microprocessor Overview
39 pages
Architecture of 80386 Micropro
No ratings yet
Architecture of 80386 Micropro
39 pages
Architecture of 80386
No ratings yet
Architecture of 80386
39 pages
CUME218TEST
No ratings yet
CUME218TEST
3 pages
Elektor Electronics 1998-11
No ratings yet
Elektor Electronics 1998-11
54 pages
Microprocessor 80386
No ratings yet
Microprocessor 80386
40 pages
Data Sheet
No ratings yet
Data Sheet
47 pages
EXPT 10 Circular Conv Using DSP Kit
No ratings yet
EXPT 10 Circular Conv Using DSP Kit
11 pages
ES Assignment 3
No ratings yet
ES Assignment 3
12 pages
Lab MC P1 1618200
No ratings yet
Lab MC P1 1618200
17 pages
Exam 3
No ratings yet
Exam 3
8 pages
A Low Power A - D Converter 05613669
No ratings yet
A Low Power A - D Converter 05613669
4 pages
01 Introduction
No ratings yet
01 Introduction
29 pages
FPGA-based Direct Conversion Receiver With Continuous Acquisition To A PC
No ratings yet
FPGA-based Direct Conversion Receiver With Continuous Acquisition To A PC
25 pages
RS232 VHDL Lab Exercise Guide
100% (1)
RS232 VHDL Lab Exercise Guide
37 pages
DSP Notes
50% (2)
DSP Notes
26 pages
80386 Microprocessor Features
No ratings yet
80386 Microprocessor Features
39 pages
LC786960E
No ratings yet
LC786960E
24 pages
Review of Analog-To-Digital Conversion Characteristics and Design Considerations For The Creation of Power-Efficient Hybrid Data Converters
No ratings yet
Review of Analog-To-Digital Conversion Characteristics and Design Considerations For The Creation of Power-Efficient Hybrid Data Converters
29 pages
06 Wireless Report
No ratings yet
06 Wireless Report
17 pages
Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective
No ratings yet
Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective
9 pages
Dot Matrix LCD Controller Lsi: Toshiba Cmos Digital Integrated Circuit Silicon Monolithic
100% (1)
Dot Matrix LCD Controller Lsi: Toshiba Cmos Digital Integrated Circuit Silicon Monolithic
48 pages
2019 Enel3ds - Exam - Main
100% (1)
2019 Enel3ds - Exam - Main
15 pages
DSP Signal Generator Implementation On C6713 DSK - ComLab
No ratings yet
DSP Signal Generator Implementation On C6713 DSK - ComLab
45 pages
Tutorial On TI C6678
100% (1)
Tutorial On TI C6678
65 pages
Pdf-Xchange Driver Api SDK
No ratings yet
Pdf-Xchange Driver Api SDK
70 pages
Difference Between Mealy Machine and Moore Machine Gate Notes 46
No ratings yet
Difference Between Mealy Machine and Moore Machine Gate Notes 46
2 pages
GV60WIFI Upgrade Instructions
No ratings yet
GV60WIFI Upgrade Instructions
12 pages
Tornado Dynamics Course at University by Slidesgo
No ratings yet
Tornado Dynamics Course at University by Slidesgo
53 pages
Carrier Split AC SBI 2375
No ratings yet
Carrier Split AC SBI 2375
1 page
COURSE SPECIFICATION MSC Computer Science
No ratings yet
COURSE SPECIFICATION MSC Computer Science
8 pages
Modular Assessment Syllabus For National Certificate in Electrical Installation Systems and Maintenance (Nces) - 1
No ratings yet
Modular Assessment Syllabus For National Certificate in Electrical Installation Systems and Maintenance (Nces) - 1
74 pages
Easypaisa REST APIs Without RSA Integration Guide
No ratings yet
Easypaisa REST APIs Without RSA Integration Guide
12 pages
Comparative Analysis Between Amazon and Flipkart
No ratings yet
Comparative Analysis Between Amazon and Flipkart
16 pages
MCA 2nd Sem: Data Structures Exam
No ratings yet
MCA 2nd Sem: Data Structures Exam
4 pages
IOT Assingment
No ratings yet
IOT Assingment
13 pages
December 2022 Bank Statement
No ratings yet
December 2022 Bank Statement
2 pages
Ch-8 Solution
No ratings yet
Ch-8 Solution
9 pages
Easy Access Rules For Third Country Operators - Revision From April 2023 PDF
No ratings yet
Easy Access Rules For Third Country Operators - Revision From April 2023 PDF
38 pages
AT Commands for Modem Control
No ratings yet
AT Commands for Modem Control
8 pages
EATON Circuit Breakers
No ratings yet
EATON Circuit Breakers
1 page
Applied Optimization For Wireless, Machine Learning, Big Data
No ratings yet
Applied Optimization For Wireless, Machine Learning, Big Data
1 page
Filtro Hidráulico Pi270 Mahle
No ratings yet
Filtro Hidráulico Pi270 Mahle
6 pages
Value Stream Mapping Sofia Albohaithi and Kostas Kefalakis
No ratings yet
Value Stream Mapping Sofia Albohaithi and Kostas Kefalakis
26 pages
Dr. Reddy's Energy Management Achievements
No ratings yet
Dr. Reddy's Energy Management Achievements
20 pages
Winder Ai
No ratings yet
Winder Ai
10 pages
Fire Extinguisher Safety Guide
No ratings yet
Fire Extinguisher Safety Guide
28 pages
Installation Guide: Dfmpro For Solidworks
No ratings yet
Installation Guide: Dfmpro For Solidworks
17 pages
How To Start Preparing For IIT JEE 2025
100% (1)
How To Start Preparing For IIT JEE 2025
18 pages
Fittings PDF
No ratings yet
Fittings PDF
3 pages
CMC Type A Product Release Checklist Issa
No ratings yet
CMC Type A Product Release Checklist Issa
6 pages
NumPy Assignment: 50 Essential Tasks
No ratings yet
NumPy Assignment: 50 Essential Tasks
3 pages
Project Report For Spectacular Spectacles
No ratings yet
Project Report For Spectacular Spectacles
7 pages
15 - Lecture - Chapter 11 - Boundary Based Image Representation - Chain Code
No ratings yet
15 - Lecture - Chapter 11 - Boundary Based Image Representation - Chain Code
32 pages
Bomba Flowserve 12EHM 27414 - PU - VPC - Esite - GB
No ratings yet
Bomba Flowserve 12EHM 27414 - PU - VPC - Esite - GB
8 pages

Note 64 X

Uploaded by

Note 64 X

Uploaded by

Evaluating the TI C64x DSP for the Liquid Argon ROD

Frequency 300 MHz 600 MHz

Program memory 384 KB in two blocks 16 KB L1P cache

Data memory 512 KB in two blocks 16 KB L1D cache

Cold starta Warm programb

Cross-path stalls 576 576

L1D stall cycles 1240 1244.8

L1D read misses 234 of which: 232.2

L1P stall cycles 395 0

Total CPU clock 3954 3400.0

Effective CPU clock 61.8 53.1

3 Implementation of the optimal filtering algorithm on the C64x platform

3.2 Data access issues

3.2.1 Calibration constants and FEB input data

Instruction type 6203 slots 64x slots Modifications

Core arithmetic 33 25 + n +2 PACK –8 ADD

Calibration constants 13 LDW 6 LDDW + 1 LDW

FEB dataa 5/2 LDW 1 LDDW + 1/2 LDW

Gain wordsb 1/2 LDW 1/2

Total core arithmetic and memory 52 36 + n

Multiply 24 MPY +8 DOTP –16 MPY

Total instruction count, actual loop26 loop16

Type 1 8 words for channel 0 8 words for channel 1

Type 2 channel 0 channel 1 channel 2 channel 3

Figure 1: Possible organizations of calibration constants in L1D cache lines.

L1D lines for 128 channels L1D read miss occurence

FEB 2 4 1 every 32 channels

FEB 1 16 2 every 16 channels

Type 2 0 32 96 0 2 every 8 6 every 8

All channels high gain 610.1 395.1 315.9

5% medium gain 479.4 428.4

10% medium gain 532.1 519.6

20% medium gain 604.7 593.1

All channels high gain 4.12 3.18 2.54

5% medium gain 3.34 2.85

10% medium gain 3.40 3.01

20% medium gain 3.44 3.05

All channels high gain 148 124 124

10% medium gain 156.4 172.7

20% medium gain 175.6 194.5

3.2.2 Division table

3.3 Loop16 implementation and performance

L1D stalls / event

All channels with time infor- No channels with time infor-

L1D bank conflicts 1.6a 0

L1D read misses 204.9 122.6

Total CPU clock cycles 1721 1396

Effective CPU clock cycles 26.9 21.8

1 A brief description of the loop26 algorithm can be found at:

You might also like