A Switch Design For Multi-Processor System On Chip
A Switch Design For Multi-Processor System On Chip
電子工程學系 電子研究所碩士班
碩 士 論 文
單晶片多處理器系統的通訊交換器設計
研 究 生:黃 保 瑞
指導教授:周 景 揚 博士
          中 華 民 國 九 十 三 年 七 月
       單晶片多處理器系統的通訊交換器設計
                國 立 交 通 大 學
             電子工程學系 電子研究所碩士班
                   碩士論文
                               A Thesis
        Submitted to Department of Electronics Engineering
       College of Electrical Engineering and Computer Science
                   National Chiao Tung University
              in partial Fulfillment of the Requirements
                           for the Degree of
                      MASTER OF SCIENCE
                             in
Electronics Engineering
July 2004
                  中華民國九十三年七月
 單晶片多處理器系統的通訊交換器設計
研究生 : 黃 保 瑞 指導教授 : 周 景 揚 博士
國 立 交 通 大 學
電 子 工 程 學 系 電 子 研 究 所 碩 士 班
摘 要
隨著半導體製程的不斷進步,十年後,IC 工程師將有可能在單一個晶片上整合
上百個運算元件。此時,各個元件間的通訊將會成為影響系統效能的一大關鍵。IC
設計工程師將需要一個能考慮通訊效能的系統設計方法。在這篇論文中,我們提出
了一個適用於單晶片上多處理器的通訊架構。經由適當的設定,這個架構將可以提
供不同的資料交換機制。由於我們的通訊架構具有能預測通訊效能的特性。系統設
計者可以利用我們的架構在設計初期便分析系統效能以作出更好的決定。相關的實
驗也顯示我們的架構能夠有效的傳遞資料。
                     i
      A Switch Design for Multi-Processor
               System-on-Chip
Institute of Electronics
                                 ABSTRACT
  Driven by the advance of semiconductor technology, it is possible to integrate
hundreds of processing elements on a single chip in the next decade. At the moment,
communication between the components will become the limiting factor for system
packet switching, and dedicated bus. System designers can also benefit from our
framework to analyze the system performance and make better decisions at higher level
                                            ii
because our platform exhibits predictable performance. The experiments of performance
evaluation show that the communication fabrics can efficiently transfer data within
system.
                                          iii
                     Acknowledgements
  I would like to express my sincere gratitude to my advisors, Professor Jing-Yang Jou
for his suggestion and guidance throughout the course of this thesis. I am also indebted
to Cheng Yeh Wang and Lin Yu Ling for their great help on my research. Special thanks
to all members in the EDA lab and my friends in Mountain Club for their friendship.
Finally, I would like to show my appreciation to my family and Mei Hsuan Chen for
                                            iv
Contents
摘要……………………………………………………………………………………i
Abstract………………………………………………………………………………..ii
Acknowledgements…………………………………………………………………...iv
Contents………………………………………………………………………………..v
Lists of Tables………………………………………………………………………..vii
Lists of Figures………………………………………………………………………viii
Chapter 2 Preliminaries.................................................................................................. 9
2.1 Topology........................................................................................................ 9
                                                              v
   3.3       Transaction .................................................................................................. 30
3.6 Performance................................................................................................. 37
4.1 Definition..................................................................................................... 43
Vita…………………………………………………………………………………...53
                                                             vi
   List of Tables
Table 1 : Histograms of normalized latency under different injection rate ..................... 44
Table 2 Histograms of normalized latency under different buffer size of virtual channel
     ................................................................................................................................. 46
                                                                    vii
   List of Figures
Figure 1 : Protocol stack of inter-network......................................................................... 4
                                                              viii
Figure 21 : Output arbitration.......................................................................................... 35
                                                               ix
Chapter 1
    Introduction
  As the semiconductor technology advances, SoCs in the next decade are expected to
power. However, designers encounter some new problems: wire delay becomes the
limiting factor for the signal delay, communications between computing components
become the bottleneck of system performance, and system design becomes more
observed that traditional design flow is incapable of solving these problems. Designers
need not only new system architectures but also a new design methodology to conquer
possible to integrate multi-billion transistors on a single chip within ten years [9]. The
chip fabricated by 50nm technology can work at around 10GHz or faster. At the moment,
wire delay will dominate the signal delays [16]. The gate delay of a transistor is scaled
down linearly, whereas wire delay remains constant with scaling. Although larger wire
delay can be managed with wire pipelining techniques, it is unavoidable for designers to
Clock synchronization in future system is another problem for the system designers.
Because the clock skew is not negligible any more, synchronizing all components on the
chip with single clock will become almost impossible. The fact that global wire which
spans the whole chip, like the clock signal, may conduct signals with latency that
exceeds the clock cycle will also make the system synchronization problem more serious.
architecture, like AMBA bus [13], is the most convenient architecture in current SoC
integration. Such kinds of architectures can support broadcast transmission, and cost low
simultaneous requests degrade the system performance and make extra power
                                               2
consumptions. These problems make designers discard share-bus architecture and search
for some new communication architectures. Some studies propose different solutions
like routing packet without wire [2], or using different topologies for specific
applications [10]. There are still open problems in selecting a suitable architecture for
Another important issue is the time to market. It becomes more complex to design a
system due to integrating more components. Traditional design flow is not sufficient to
conquer this problem. The design trend is toward system level design. The impact that
considered.
It seems a feasible way to solve these new problems by applying similar concepts that
are maturely developed in other fields. Some researchers adapt the layer method from
traditional inter-network to manage the on-chip communication [4][8]. They view the
method to build the on-chip communication infrastructure. Figure 1is the protocol stacks
paradigm adapted from inter-network; with bottom up construction, the layers spans
increasing design abstraction level [4]. System designers and architecture designers can
work together to implement a system under different abstraction levels. Also, this
                                             3
method maximizes the ability of reusing components.
some differences between communication infrastructure of SoC and wide area network.
memory space for on-chip system and the physical issue of fabrication.
Driven by the advances of semiconductor technology, future SoCs will accelerate the
functional components within the same die size to obtain more computing power. At this
among computing components becomes the most critical factor and make it more
                                             4
complex to design the system and to predict the system performance. It is an urgent
issue to balance the communication and computation power over the whole system.
In this flow, we first separate the system design into two parts: functional modeling
partition and job scheduling. Designers can collect rough system information by
library that contains the simulation models of various computation, memory components
                                           5
platform.
After function and architecture modeling, system designers will map and allocate the
tasks scheduled onto the platform which they chose in architecture modeling. By
detailed information of the whole system and make better design trade-off. With this
method, system designer can refine the implementation of design at higher level and
Traditional design flows will follow after system designers decide the details of
implementation.
There have been various studies in this field. We present some of them which are
into layers to maximize the reuse and provide programmers with an abstraction of the
underlying communication framework. The OSI Reference Model is adapted to the layer
                                           6
1.4.2 Network architecture
The analysis of why the shared bus, which dominates the system integration now, will
not meet the performance requirement of future system is proposed in [1]. They present
analyzing system performance are studied in [14] and [15]. In their experiments, they
discussed the relative strengths and weakness of the considered architectures for system
design.
Some studies address the network fabric design. A circuit switching architecture, the
SoCBUS, is proposed in [5]. SoCBUS has very good properties in providing guaranteed
bandwidth, and is suitable to build the real time system. However, it is not suitable for
general purpose computing that exhibits random traffic patterns. A hybrid router design
property of both switching techniques but still suffers the lack of channel utilization.
Based on the promise that communication will become the bottleneck of future
                                               7
propose a novel network platform and related infrastructure for on-chip communication
in this thesis. System designers can also benefit from our framework to analyze the
system performance and make better decisions at higher level because our platform
present details of our platform design and a novel switch design for on-chip
communication. We prove the correctness of our platform and study some design space
explorations in Chapter 4. Finally, we give the conclusion and future work in Chapter 5.
                                             8
Chapter 2
    Preliminaries
  In this chapter, we introduce some basic concepts and related issues about
constructing network. This chapter provides background knowledge of our platform and
2.1 Topology
The word “topology” defines how the nodes are interconnected by channels and is
usually modeled by a graph [7]. The nodes include communication fabrics, bridges and
processors. Major network topologies can be categorized as direct network and indirect
                                           9
network. In direct network, nodes are connected directly with each other by the network.
In indirect network, nodes are connected by one or more intermediate node switches.
The switching nodes perform the routing and arbitration operations. Because of different
performance requirements and cost trade-off, many different network topologies are
designed for specific applications [11]. We are going to give a brief description of some
1. Orthogonal
orthogonal n-dimensional space. The most popular direct networks are k-ary
n-dimensional mesh, k-ary n-dimensional cube and the hypercube, as shown in Figure 3.
4-ary 2-dim mesh 4-ary 2-dim torus 2-ary 4-cube (hyper cube)
                                                10
2.        Other direct network topologies
In addition to these topologies defined above, there are many other topologies that
topologies to reduce the degree of each node. Tree topology provides the advantage of
low implementation cost, in which each of these nodes on the topology is in turn
algorithm.
1. Crossbar networks
Crossbar networks allow any node in the system to communicate with any other node
                                             11
is the cost, and has been traditionally used in small-scale system [11].
Multi-stage interconnection networks (MIN) connect the input nodes to output nodes
through switch stages, which are crossbar network. The number of stages and
connections between switch stages determine the routing capability of the networks.
Fat-tree is one classical topology of MIN. A fat-tree network can provide multiple
data paths from source node to destination nodes depending on the path usage. As shown
Among these topologies, 2-D mesh is considered as the most suitable topology for
on-chip network because the 2-D mesh has the advantages of an acceptable wire cost,
                                               12
reasonably high bandwidth, and that it is easy to group components on plane.
Switching strategy is defined as the method used to exchange data between network
also named circuit switching because the connection from source to destination is built
before data transmission. Once the connection established, data from source to
destination can be transmitted with guaranteed bandwidth and will be delivered without
any contention. With this advantage, we can employ it to build a real time system. This
strategy partitioned data into several packets before transmission. The routing and
                                            13
next node. The header information of each packet is extracted by the intermediate switch
to determine the output destination over which the packet is to be forwarded. Different
messages are short and frequent. However, the implementation of store-and forward
switching is expensive because a switch should have enough buffer size to hold a whole
packet.
Unlike the store-and-forward switching, that switch should hold the whole packet
transmission as soon as the routing decision of packet is determined and the output
channel is free. Actually, the packet doesn’t even have to be stored at the output buffer
and can cut through to the input of the next switch before the complete packet is
output channel, virtual-cut-through switching will hold the complete message in the
The requirement to buffer whole packet in the switches makes it difficult to construct
a faster and smaller switches. In wormhole switching, packets are pipelined through the
switches are reduced over that for virtual-cut-through switching. If the packet is blocked
in the network, the buffer in the switch doesn’t have the capability of buffering the
                                             14
whole packet; the blocked packets will occupy buffers in several switches. This degrades
the network performance because the packet blocked by other packets will occupy
buffers in these switches on part of its transmission path, similarly blocking other
the network state that some packets cannot advance toward their destination because the
buffers requested by them are full. As shown in Figure 6, all the packets involved in a
wormhole switching. The key idea is to multiplexing the physical channel to support
output channel at each switch in Figure 7(virtual channels), all the packets blocked in
the switches continue to make progress with half the channel bandwidth as shown in
Figure 7(virtual channels solve the deadlock problem). This technique can not only solve
                                            15
deadlock problem but also improve network throughput.
Routing algorithms determine the path followed by each packet. Figure 8 presents a
taxonomy of routing algorithms that are classified according to several criteria. Routing
algorithms can be first classified according to the number of destinations. Packets may
can also be classified according to the place where the routing decisions are made. The
distributed manner while across the network (distributed routing), or hybrid schemes.
Moreover, routing algorithms can be classified according to the way they are
implemented. The most popular ways consists of either looking at a routing table or
executing a routing algorithm in software and hardware based on finite state machine. In
                                            16
both cases, they can be either deterministic or adaptive according to whether the packet
transmitted between a given source/destination pair is supplied with the same path.
and backtracking. Progressive routing moves the header forward, reserving a new
channel at each routing operation. Backtracking allow the header to backtrack while it is
blocked. Backtracking routing algorithms are mainly used for fault tolerance. In the
scope of adaptive routing, routing algorithms can be classified according to the distance
the packet closer to the destination across the network, while misrouting algorithms may
send packet away from the destination. The last taxonomy is according to the number of
                                             17
                     Figure 8 : A taxonomy for routing algorithms [7]
rules for transmitting data between two devices. The protocols determine the type of
handshaking convention between sending device and receiving device. It not only
defines how senders and receivers execute the communication transactions, but also
determines how data flows across the network. For on-chip communication, different
protocol options greatly influence the reliability and power consumption issues.
                                            18
Chapter 3
complete description of our platform and switch design. We will also present how to use
the switch to transmit messages between components through illustrations. After that,
we will review how our platform meets these requirements of constructing future
on-chip network.
                                             19
3.1 What network we need
When it comes to constructing the network infrastructure for future system on chip,
application domains. Some applications, such as Software Defined Radio and MPEG
codec, can be thread paralleling processing and they just need local and fixed
communication bandwidth. For other applications, there may be irregular traffic load
among communication channels. Here we summarize some basic concepts what future
flow.
1. Efficient communication
This is a serious problem which not only makes processing elements idle to
degrade the whole system performance, but also make extra power consumption
                                            20
         while processing elements wait for data. Thus, we must provide a network
2. Guaranteed throughput
For some applications, real time requirement is the critical issue. Circuit
promised to fabricate a perfect chip without any error on the chip. This problem
processing elements on a chip, there is better chance that we will find some
issues can be rare but unavoidable. Future network infrastructure should provide
some mechanisms such that the whole system still works smoothly with faulty
components on it.
In this chapter, we propose a novel platform and switch design as a feasible solution
to these network requirements and as the network infrastructure for the future
                                               21
3.2 Switch architecture
Our platform uses a 2-D mesh topology to organize on-chip components, as shown in
Figure 9. The main reason for selecting the two dimensional mesh is its acceptable wire
cost, and that it is easy to group components on plane [5][11]. In our platform, the
The architecture of 5-ports switch is shown in Figure 10. The switch has four ports
connecting to neighboring switches and one port connecting to local processing element.
Each port is composed of input and output stage, which is shown in Figure 13 and
Figure 14.
                                           22
                               Figure 10 : Switch architecture
The basic transmission procedure is illustrated in Figure 11. Suppose that a packet is
sent into current switch from the neighboring switch at west direction and will be
delivered forward the neighboring switch at east direction. In current switch, the packet
will be received by input stage of west port first and be stored in memory of output stage
of east port. Once the output channel of east port which is connecting to the neighboring
switch is available, the output stage of east port in current switch will send the packet to
                                              23
                         Figure 11 : Basic transmission procedure
The interface of switch is composed of input and output channel. Each channel
contains Address-line, Data-line and Ack-line. We show that in Figure 12. The
Address-line delivers the input or output address of the packet. The Data-line delivers
data transmitted. And the Ack-line feeds acknowledgement back to source switch or
processing elements to report the result of transmission. Output channel and input
(1) Address controller extracts packet address from input Address-line to decide
(2) Dispatch input data on input Data-line to the buffer which stores the data.
(3) Collect output acknowledgement from ack-controller of other output stages and
                                              25
3.2.1.2 Output stage
(1)There are four memory modules in each direction, which are called RAM in Figure
14. They are all one-read/one-write memory architecture. These four memory modules
store input data which is received by other four input stages separately. Take Figure 15
as an example. The data of these packets, which comes from north direction and will
make east turn in current switch, will always be received by input stage of north port and
                                             26
                             Figure 15 : Memory duty diagram
(2) Buffer controller records the size and status of the buffers in the switch, and
(3)Ack controller checks status of the buffer which is indicated by input address, and
responses acknowledgement according to the status. For example, in Figure 16, a packet
from south direction is transmitted across current switch to east direction. The
neighboring switch at south direction will first notify the Ack-controller of east port of
current switch to check whether there are available buffer space to store the data or not.
acknowledgement, the neighboring switch at south direction will know whether the data
                                             27
                                 Figure 16 : Ack controller
present the implementation details in this section. Each memory module, which is called
RAM in Figure 14, can be partitioned into several buffers to provide necessary virtual
channels. As illustrated in Figure 17, we partition each memory module into 4 buffers,
resulting total 16 data buffers in this port, which means a capacity of 16 virtual channels
to route packets. The size of memory will influence the flexibility of partition. For
example, a memory module which has 32 words can be partitioned to two buffers with
16-words, four buffers with 8-words, or even eight buffers with 4-words.
only in each switch but also in each port even after fabrication. Memory partition can be
platform. It will become difficult to route all the packets if we only provide few virtual
channels in each switch. On the contrary, smaller buffer size causes higher failing rate at
will give each buffer in the switch a buffer-id. Figure 18 illustrates the meaning of
expression.
                                             29
  For example, the buffer-id S3.E-S{3} is the identification of the buffer that is the third
buffer of the memory module, which stores data from south direction at east port of
switch 3.
In addition to the space for each buffer to store input data, there is another memory
space, called routing table, for each buffer to record a unique buffer-id as output address.
The data stored in current buffer will be sent to the buffer which is identified with this
buffer-id at next successful transaction. As you can figure, the buffer with this unique
buffer-id must be one of the buffers of the neighboring switch, which is connected to
current buffer. By configuring the routing tables of the switches in our platform, we can
3.3 Transaction
In the system design flow that we introduced in section 1.3, after mapping the
paths that the applications need. Assume that the system designers have decided all the
transmission paths of routing packets. The next thing we should do is to configure our
                                               30
platform to form these paths. For each transmission path, we will look for one available
buffer in each switch along the path and reserve these buffers to form a dedicated virtual
channel. Among these buffers that form this dedicated virtual channel, we will
repeatedly assign the buffer-id of the succeeding buffer for current buffer as output
address. By configuring the routing table of the buffers, we can set up this the
transmission path.
Unlike the packet switching, which should decode the packet address, search space to
store the data, and compute the routing path of the packet. We simplify the duty of
In Figure 19, assume that each memory in the switch is partitioned into four buffers
with eight words. Supposing that one of the applications in our platform will deliver
path is established by configuring routing table of buffers in these two switches. First,
we assign buffer-id: S1.S-P{2} for processor 1 as source buffer-id when it want to send
output address and assign a memory address for buffer: S2.L-N{1} as output address.
This buffer chain which is composed of the network interface of processor-1, the buffer
S1.S-P{2}, the buffer S2.L-N{1} and the network interface of processors-2 will form a
                                              31
                          Figure 19 : Path transaction procedure
procedure in Figure 19. Assume that processor-1 send a packet with 2 words to
processor-2. We mark the words as grey circle-{a} and grey circle-{b} in Figure 19.
First, processor-1 sends word-{a} to input stage of local port of switch-1 and switch-1
stores word-{a} in buffer: S1.S-L{2}, as shown in Figure 19(a). In Figure 19(b), the
output stage of south port of switch-1 send data-{a} to input stage of north port of
switch-2 and switch-2 stores data-{a} in buffer: S2.L-N{1}. At the same time,
processor-1 can send data-{b} to switch-1 and switch-1 stores data-{b} in buffer:
S1.S-L{2} as it did to word-{a}. This shows that we allow pipelined transactions. In the
last step, output stage of local port of switch-2 sends data-{a} to processor-2 and
finishes the transmission of word-{a}. Word-{b} will arrive at processor-2 with the
                                            32
same procedures.
switches. In Figure 20, we show the interface diagram between two neighboring
switches. Routing table is a mapping between buffers of east port of switch-1 and
buffer in east port of switch-1. For example, the output address of S1.E-N{1} is
Assume that at this moment, the buffer S-1.E-N{1} in switch-1 stores data that will be
                                           33
procedure between these two switches. This is a simple example but shows clearly our
procedure. Our procedure is divided into four steps: channel privilege arbitration, output
finished in one clock cycle. And these transactions can be executed in pipelined manner
to increase throughput.
Cycle 1: buffer S1.E-N{1} notes the controller that it wants to access the output
address indicates that this transaction tries to send data into buffer S2.E-W{1}.
stores data on input Data-line if it still has memory space, else discards the data.
will decide whether to keep the data or erase data it stores. If the acknowledgement is
true, it means that this is a successful transmission. Buffer S1.E-N{1} will erase the data
                                              34
3.5 Round robin scheduling
At the output stage of switch port, we implement the arbiter with round robin
scheduling technique to decide which virtual channel can get the privilege to access
output channel and transmit data.[7] In this way, these transmission paths that deliver
data in the same port of switch will use equal bandwidth of output channel. This
technique will avoid the starvation for accessing channel. Note that only these virtual
channels that are really active to transmit data will be scheduled. We won’t guarantee
channel privilege to these buffers that don’t transmit data. This will increase the channel
                                             35
                         Figure 22 : Round robin scheduling
In Figure 21, there are three messages delivered by three virtual channels separately
robin scheduling to transmit data, a short message may be possibly postponed a very
scheduling). With a round robin scheduling technique, each message can equally
share the channel bandwidth, as shown in Figure 22 (with round robin scheduling).
this example, buffer A has twice bandwidth than buffer B and buffer C.
                                          36
3.6 Performance
For each transmission between on-chip components, we will reserve buffers in those
switches which are on the transmission path to form a virtual channel connection. With
such dedicated channel and round robin scheduling, we can guarantee the minimum
bandwidth of each transmission. For example, consider that in east port of switch in
Figure 23, there are three paths sharing the bandwidth. We assign them with the same
weight. This means that for each transmission path, they are all guaranteed to use at least
                                                          channel bandwidth
  local guaranteed minimum bandwidth =
                                               sum of weight of paths using this channel
(3-1)
  If the transmission intervals of these three paths are not overlapped, the switch will
                                             37
provide higher bandwidth to these paths that are transmitting data.
The example in Figure 23 and equation (3-1) describes only local guaranteed
whole transmission path, we need to trace all the local guaranteed bandwidth in those
switches along this transmission path, and choose the smallest guaranteed bandwidth as
the minimum bandwidth of this transmission path. Equation (3-2) and (3-3) describe the
expression.
(3-2)
(3-3)
Assume we provide all the channels of our platform with the same bandwidth, called
standard channel bandwidth (scb), in Figure 24. There is a transmission path from P1 to
P6 through switches S1, S2, S3 and S6. The related LBW of these switches are shown in
the figure. The guaranteed bandwidth of this path is equal to the smallest local
                                              38
                    Figure 24 : Bandwidth guaranteed transmission path
tolerance capability. We explain how to use our platform to meet these requirements.
1. Efficient communication
traffic of our system. Moreover, we will know the constraints requirements that
         transmission paths that need larger bandwidth, we can alleviate the network load
                                             39
     by assign different paths for them. Figure 25 is an example. In (a), if both
transmission paths use the channel connection from S2 – S3 – S6, they will make
paths for them in (b) to avoid the overlapping of transmission paths and get
2. Guaranteed throughput
With dedicated virtual channel and round robin scheduling, we can provide
                                         40
     the worst cast of transmission and estimate the system performance at higher
levels.
With different path assignments, we can easily avoid to use the faulty
use the faulty switch, S3. By assigning another transmission path, we can still
transmit data from P1 to P6 correctly and solve this problem. This example
                                         41
Chapter 4
    Experimental Results
  To verify the functionality and evaluate the communication performance of our
platform, we use a 2-D mesh topology with 4-by-4 nodes as our platform. With routing
performance of our platform, individual processing element only provides the function
that generates random traffic here. We write a random pattern generating model to
replace the original processing element. This pattern generator can generate packets with
random length from random source to random destination between random intervals.
                                            42
  We implemented our switch design by both Verilog HDL and cycle accuracy C++
model. Verilog version is for traditional cell-based design flow implementation, and C++
4.1 Definition
Latency: The time elapsed from when the packet transmission is initiated until the
Maximum Latency: The predicted worst case latency under guaranteed bandwidth.
4.2 Experiment
We verify that our platform can guarantee the minimum bandwidth for each
transmission. In this experiment, we provide each buffer in the intermediate switch with
2 words and evaluate the performance of our platform under different injection rate.
                                              43
  First, we observe that even under high injection rate (injection rate = 1), packets in the
network are still delivered to the destination within maximum latency (normalized
latency < 1). The result shows that our platform guarantees the performance even at
worst-case.
  Secondly, in this table, the histogram of the normalized latency shifts to low as
                                             44
injection rate decreases. It indicates that the average latency decreases as injection rate
decreases. This trend implies that if we can well control the communication to decrease
This histogram also shows the trend that the pattern proportion at high injection rate
and low injection rate are more centralized than these at medium injection rate. The
reason for very high injection rate is that almost all of the packets are transmitted with
guaranteed bandwidth and transmission path hardly use extra bandwidth. That makes the
normalized latency at high injection rate close to maximum latency. The reason for low
injection rate is similar. Almost all of the packets can be delivered without contention
and transmission paths can use almost full channel bandwidth. Only few packet
transmissions will overlap and share the same channel bandwidth. That makes the
average latency of transmissions at low injection rate close to minimum latency. On the
contrary, these transmissions at medium injection rate will sometimes come up against
contention and sometimes be transmitted without blockings. This property makes that
the actual bandwidths of these transmission paths vary from guaranteed bandwidth to
full channel bandwidth. It will cause the latency of transmissions become uncertainty.
That’s why the packet proportions at medium injection rate are widely distributed.
                                               45
4.2.2 Different communication quality of our platform
channel
In Table 2, we show the histograms of normalized latency under different buffer size
of virtual channels. A trend is clear observed that the average normalized latency
decreases as the buffer size increases. This means that if the system on our platform
                                            46
needs some transmission paths that need large channel bandwidth, we can simply
implement the queue in the switch with 32 words, and each word is four Bytes.
We use TSMC 0.25um technology. The synthesis report shows that our switch design
can work at 185MHz (clock cycle=5.4ns). The area of our switch design is about 3.5
mm^2. The main area is used on memory modules, which is about 2.7mm^2.
        [Technology]
             TSMC .25um 1p 4m
        [Area]     ( unit: um^2 )
                                                      47
Chapter 5
at the same time, we proposed a novel switch design for on-chip communication. This
for different applications. With dedicated virtual channels and round robin scheduling,
we can guarantee the minimum channel bandwidth for transmission paths. By using
pipeline bus as basic communication mechanism, the system can transfer data in
pipeline fashion to increase performance. The experimental results indicate that our
switches is the trade-off between flexibility and cost. It depends on application features
and needs the top down design flow to optimize it. The selection of transmission paths
                                             49
  Reference
[1] Pierre Guerrier and Alain Greiner, “A generic architecture for on-chip
[2] William J. Dally and Brian Towles, "Route packets, not wires: on-chip
2001.
Conference, 2001.
[4] Luca Benini and Giovanni De Micheli, “Networks on chips: a new SoC paradigm,”
[5] Daniel Wiklund and Dake Liu, “SoCBUS: switched network on chip for hard real
Symposium, 2003.
[6] E. Rijpkema et al., “Trade offs in the design of a router with both guaranteed and
                                           50
[7] Jose Duato, Sudhakar Yalamanchili, and Lionel Ni, Interconnection Networks: an
Publishers , 2003.
[9] http://public.itrs.net/
[11] Terry Tao Ye, On-chip, Multiprocessor communication network design and
[12] Halsall, Fred, Data communications, computer networks, and open systems ,
Addison-Wesley, 1992.
[13] www.arm.com
[14] Kanishka Lahiri, Anand Raghunathan, Sujit Dey, “Efficient Exploration of the SoC
                                                  51
    Architectures”, Proceedings of the The 14th International Conference on VLSI
Design, 2001.
[17] W, Dally and J. Poulton, Digital Systems Engineering, Cambridge University Press,
1998.
                                           52
                                        Vita
  Pao-Jui Huang was born in Changhu on May 30, 1980. He received the B.S degree in
Electronics Engineering from National Tsing Hua University in June 2002. From
September 2002 to July 2004, he was a graduate student of Professor Jing-Yang Jou in
the institute of Electronics, National Chiao Tung University. His research was related to
53