c2c Gem5 Date 25 Final
c2c Gem5 Date 25 Final
Chip-to-Chip Interconnects
Luis Bertran Alvarez, Ghassan Chehaibar, Stephen Busch, Pascal Benoit,
David Novo
Abstract—High-Performance Computing (HPC) is shifting to- the latency of local and remote main memory accesses. We
ward chiplet-based System-on-Chip (SoC) architectures, neces- evaluate our model running PARSEC [10] workloads on two
sitating advanced simulation tools for design and optimization. CHI chips operating under the same OS, with a distributed,
In this work, we extend the gem5 simulator to support cache-
coherent multi-chip systems by introducing a new chip-to-chip non-interleaved shared memory system. Our results include
interconnect model within the Ruby framework. Our implemen- the observation of key metrics in a C2C scenario like the
tation is adaptable to various coherence protocols, such as Arm distribution of traffic type and bandwidth at the C2C link. We
CHI. Calibrated with real hardware, our model is evaluated using also show the execution overhead of using a coherent C2C
PARSEC workloads, demonstrating its accuracy in simulating co- link with real-life applications. Using hardware, we calibrate
herent chip-to-chip interactions and its effectiveness in capturing
key performance metrics early in the design flow. the model and show closely matching results, showcasing the
Index Terms—computer architecture simulation, HPC, cache flexibility of our simulator.
coherency, chiplet, chip-to-chip interconnect. We make the following contributions in this paper:
• We identify the key challenges to extending CHI cache
I. I NTRODUCTION coherency to multiple chips.
Driven by the need for cost-efficiency and improved yields, • We introduce c2c-gem5, a new cache-coherent chip-to-chip
High-Performance Computing (HPC) is transitioning from interconnect model in gem5, and make our model freely
traditional monolithic chip designs to chiplet-based System- available as open-source [11].
on-Chip (SoC) architectures [1]. Accordingly, adequate simu- • We calibrate c2c-gem5 using a real hardware platform.
lation technology is required to tackle the new challenges in • We show results of c2c-gem5 running PARSEC workloads.
designing, evaluating, and optimizing these systems.
II. BACKGROUND
A particularly interesting chiplet-based design approach
includes multiple cache-coherent multi-core chips that, when In multi-core systems, cache coherency is maintained by
connected, maintain coherency across all interconnected chips. protocols that ensure all cores have a consistent view of
CCIX [2], UCIe [3] and OpenCAPI [4] are examples of open memory. Arm CHI is a modern cache coherency protocol
standards proposed to enable such systems. In this approach, used in Arm [12] and RISC-V [13], [14] HPC architectures
a single-chiplet CPU serves edge applications, while multiple- that support large many-core systems using a directory-based
chiplet CPUs deliver many-core cloud performance. Unfortu- coherence mechanism. CHI supports MESI and MOESI cache
nately, existing architecture simulators fall short in modeling coherence models and comprises three main components:
such systems. They lack flexibility, detailed modeling, and • The request node (RN) initiates transactions and sends
OS control at the node level, limiting accurate simulation of memory requests. A fully coherent request node (RNF)
modern coherent interconnect protocols [5]. caches data locally and must respond to snoop requests.
Our goal in this work is to extend the gem5 simulator [6], a • The interconnect (ICN) serves as the responder for request
widely-used and open-source architecture simulator, to model nodes and encapsulates fully coherent home nodes (HNF),
cache-coherent multi-chip multi-processor systems. To this which act as points of coherency (PoC) and serialization
end, we design a new cache-coherent chip-to-chip (C2C) (PoS) for specific address ranges. HNFs manage snoop
interconnect model within Ruby [7], a highly configurable requests to RNFs and memory access requests to SNFs.
cache coherence protocol simulator used within the gem5 • The subordinate nodes (SNF) interface with the memory
simulation framework. Our interconnect model is designed controllers.
to connect multiple chips that implement the Arm Coherent Figure 1 shows a simplified diagram of two chips, each
Hub Interface (CHI) coherence protocol [8], a state-of-the-art including the necessary components to support local CHI
solution for high-performance, scalable, multi-core systems. coherency.
Given the uncertainty about which chip-to-chip interconnect The CHI protocol includes four message types: (1) request
standard will become dominant, our goal is not to implement messages initiate transactions such as read and write opera-
a specific protocol (e.g., CCIX). Instead, we aim to provide a tions, (2) snoop messages maintain cache coherence by query-
reference implementation that can be adapted to any protocol. ing other caches for the state of specific data, (3) response
We use a real hardware platform made up of two Arm messages provide the status or data needed to complete a
N1SDP boards [9] connected via CCIX to calibrate our transaction in response to requests and snoops, and (4) data
simulation model. We design a microbenchmark to measure messages carry the actual data between components.
ICN
RNF-0 RNF-1
C2CI-0 serves as the destination of the request in the local
RNF-0 C2CI-0 C2CI-1 RNF-1
HNF-0 NoC-0, while C2CI-1 is the initiator of the forwarded request
Chip-0
Chip-1
NoC-0 NoC-1 NoC in the remote NoC-1. To achieve this, NoC routing tables
HNF-1
must be updated to direct remote accesses to the local C2CI,
HNF-0 SNF-0 SNF-1 HNF-1 SNF-0 SNF-1
and C2CIs need to incorporate both responder and initiator
ICN-0 MemCtrl-0 MemCtrl-1 ICN-1 MemCtrl-0 MemCtrl-1 functionality.
(a) (b) Tracking remote sharers. The CHI HNF must track all
caches storing a copy of a specific memory block to maintain
Fig. 1: High-level C2C architecture (left) and reference archi- data consistency across the system. When a cache modifies a
tecture (right) with CHI components and C2CIs. shared memory block, the HNF identifies all caches holding
copies of that block and issues snoop requests to either
invalidate or update their copies. To extend CHI coherency
III. C OHERENT C HIP - TO -C HIP I NTERCONNECT
across multiple chips, the HNF needs to be extended to track
To enable global coherence across multiple CHI chips, sharers located off-chip.
we propose a new Chip-to-Chip Interface (C2CI) component
Interactions between C2CIs and local chips. For certain
along with minimal modifications to the HNF component, and
transactions, the behavior of the C2CI must differ based on
additional metadata in the messages.
whether the initiator is local or remote. For example, when a
A. On-chip vs. chip-to-chip requests SnpSharedFwd request reaches the C2CI, it should expect a
To demonstrate the key differences between a standard on- Snoop Response (SnpResp) if the request was initiated by a
chip local memory request and one that requires traversing the remote RNF, but it should expect an additional Data message
chip-to-chip interface to access memory on another chip, we (Data) if the request was initiated by a local RNF. To enable
refer to Figure 1 and the following examples. this differentiation, new metadata must be added to the CHI
When Chip-0 is not connected to Chip-1, it operates follow- messages to differentiate between local and remote initiators.
ing the standard CHI protocol. Thus, when RNF-0 initiates a Chip-to-chip link management. The C2CI must keep track
local memory request, NoC-0 routes the request to HNF-0, of multiple in-flight transactions that are currently being
which is responsible for managing memory coherency within processed by storing the necessary metadata (e.g., source, des-
Chip-0. HNF-0 monitors all caches that store a copy of the tination, type of transaction, etc.) locally in Transaction Buffer
requested memory block (i.e., sharers). If the data is not found Entries (TBEs). This is straightforward when transactions op-
in any cache, HNF-0 requests SNF-0 to retrieve the data from erate on different addresses. However, it becomes significantly
Chip-0’s main memory and to send it to RNF-0. Instead, if the more complex when transactions operate on the same address.
data is present in a local cache, HNF-0 issues a snoop request In such cases, the address is no longer sufficient to identify
(e.g., SnpSharedFwd) to that cache, which provides RNF-0 individual transactions, and more complex mechanisms are
with the most up-to-date version of the data. Finally, HNF-0 needed. One alternative is to limit the number of C2C in-flight
updates the state of the block after receiving the corresponding transactions to the same address.
acknowledgment messages.
When Chip-0 is connected to Chip-1 and RNF-0 initiates a IV. GEM 5 I MPLEMENTATION C HALLENGES
memory request to a block allocated to Chip-1’s main memory, We implement a new cache-coherent chip-to-chip inter-
new challenges arise. First, NoC-0 should check the address connect model called c2c-gem5 within gem5 [6], [15], a
range and route all remote memory accesses to the chip-to- highly modular, cycle-accurate computer architecture simula-
chip interface C2CI-0, instead of the local HNF. C2CI-0 then tor widely used in industry and academia. The gem5 simulator
forwards the request to C2CI-1, as it belongs to the chip provides detailed models for CPUs, caches, memory and full-
housing the HNF managing the address of the request. C2CI-1 system simulation. It includes two different cache systems:
initiates a new request that NoC-1 routes to HNF-1. In this Ruby, which models cache coherence protocols with high
case, HNF-1 must monitor all caches that may hold a copy of fidelity, and the “classic” caches, which lack cache coherence
the requested block, both locally on Chip-1 but also remotely fidelity and flexibility. We build c2c-gem5 within Ruby using a
on Chip-0. When sharers are remote, HNF-1 must issue snoops domain-specific language called SLICC, which allows users to
that need to be forwarded by the C2CI-1 to Chip-0 and vice define the specific architecture and behavior of each controller
versa for the corresponding acknowledgments. and message type within a coherence protocol. Conveniently,
Ruby includes an implementation of the CHI protocol devel-
B. Main design challenges
oped by Arm [16].
To extend CHI coherency across multiple chips, four main However, implementing c2c-gem5 within Ruby is not
challenges must be addressed. straightforward, as Ruby does not natively support multiple
Routing across chips. The CHI NoC only routes messages to interconnected NoCs. Ruby was originally designed to model
local components. Thus, for remote accesses, two C2CIs must monolithic chips including a single NoC model. But modeling
coordinate to function as a gateway. In the previous example, multiple chips in Ruby requires multiple NoCs (see NoC-0
chip-to-chip requests via the C2C link. Fourth, it receives chip-
C2C link
C2CI-1
C2CI-0
NoC-1
NoC-0
In Out Out In In Out
d
te to-chip requests directed to the address space allocated to its
ec
Out In In Out Out In
p
on-chip HNF.
Ex
C2C link
C2CI-1
C2CI-0
NoC-1
NoC-0
In Out Out In In Out
y
lit
C2C link
C2CI-1
C2CI-0
NoC-1
NoC-0
Out In In Out Out In must support 73 different CHI transactions that can transit
l
So
gem5 ports the C2C link. In order to keep the code to a reasonable
length (approximately 2700 lines) we use two main mecha-
Fig. 2: Port connection problem when using Ruby ports nisms. First, we use generalized actions to handle different
between two controllers of different networks. types of requests. For example, sendToNet Data sends a data
message to the network, independently of the type of data
(e.g., CompData_UC, SnpRespData_SC, etc.). The fields
and NoC-1 in Figure 1). Recent work [17], [18] removed of each message are filled with information from the TBE.
the single NoC limitation, enabling the use of multiple NoCs Second, we sort the requests into different FSMs. Some
to model separate processor and GPU networks in the same requests are treated the same way, and can be executed using
Ruby environment. However, connecting multiple NoCs and the same FSM. For example, ReadShared is treated the same
extending their coherence is still problematic: Ruby makes way as ReadUnique.
implicit assumptions about how its controllers are connected The C2CI handles two different types of Ruby messages:
to the NoC. the classic CHI message and a new C2C message (C2cMsg),
Figure 2 illustrates the problem. The top subfigure shows the as shown in Figure 3. We define C2cMsg using SLICC to
expected configuration when connecting two Ruby controllers carry requests, snoops, data, and responses over the C2C link.
of different networks (e.g., C2CI-0 from NoC-0 and C2CI-1 C2cMsg contains metadata serving two primary purposes: first,
from NoC-1). However, as shown in the middle subfigure, to facilitate transmission through the C2C link, and second,
Ruby automatically connects all the ports of a controller to to enable the generation of appropriate CHI messages for
its corresponding network, including the C2C-side ports that injection into the destination NoC. This also includes CHI-
we intend to connect to the C2C link. To address this issue, specific fields like RetToSrc that affects whether or not a
we use gem5 ports instead of Ruby ports to connect the C2CI snoop response returns data, or AllowRetry if the request
to the C2C link, as shown in the bottom subfigure. In gem5, can be retried or not. Additionally, C2cMsg includes C2C-
controllers outside of the Ruby environment use gem5 ports specific fields like C2c sharers that indicate the off-chip sharer
(i.e., classic ports) and gem5 packets for communication. For associated with an address for snoop routing. A Priority field
example, MemCtrl-0 in Figure 1 communicates with SNF-0 is also included for C2C link access control to indicate that the
using a classic gem5 port. These bi-directional gem5 ports CHI message is a priority snoop and should be treated, putting
are fully controlled by the designer, avoiding the connectivity the ongoing transaction on hold. This link access control is
issues encountered with Ruby ports. Models outside of Ruby used to manage the number of in-flight transactions, as indi-
are addressed through gem5 classic ports, only one of such cated in Section III-B. Specifically, we implement a handshake
port, called mem_out_port, is available to Ruby controllers. mechanism that enables two controllers to acknowledge their
This port, defined in the AbstractController class, is readiness to process a request for a specific address.
inherited by all Ruby controllers. Consequently, in order to Figure 4 shows a flow diagram of a message exchange for
declare a new gem5 classic port, we need to extend the request acknowledging between two C2C-Interface controllers.
AbstractController class, as well as the SLICC compiler. In this example, the C2CI-0 is handling the address range
V. C2CI I MPLEMENTATION containing the address of the request. This means that it has
priority and does not need acknowledgment. In 1 C2CI-0
Figure 3 shows a high-level overview of c2c-gem5. It fea- receives a CHI request. Then, the controller determines if
tures a new Ruby controller named C2C-Interface (C2CI) 1 , it is granted priority or not. This happens in the Ack block
and minimal modifications to the HNF component 2 and of Figure 3. In this example, blue request has priority so
the CHI message format 3 . These elements work together to it is sent to the C2C link. In 2 two things happen. First,
enable inter-chip communication within a single Ruby system C2C-1 checks if the yellow request has priority. It is not
and extend cache coherency across the chips. the case so C2CI-1 goes through the handshake process. The
C2C Interface Controller. The C2CI has multiple key re- yellow request is sent to the C2C link with a special flag.
sponsibilities. First, it acts as an HNF for all local requesters. Then, the initial request is enqueued in an internal queue to
In Figure 1 the C2CI-0 is seen as HNF-1 by RNF-0. C2CI wait for the acknowledgment. This queue is labeled stalled
is a proxy for all off-chip address ranges. Second, it acts as in Figure 3. Second, the blue request is received by C2CI-1.
a requester, injecting CHI request messages in the network The controller determines how the request is treated. In this
in order to service requests initiated off-chip. Third, it sends example, the request has priority and is sent to NoC-1. In 3
1
Req in Ack Alloc. TBE table Alloc. HNF-1 Dir entry Ruby env.
2
Snp in Alloc. Exp local sharer gem5 env.
C2C link
Stalled
c2c sharer
Req rdy
C2C req rdy
Dat in gem5 port
Req from c2c Packet NoC-1
Rsp in C2cMsg
NoC-0
Fig. 3: High-level overview of c2c-gem5, including the C2CI controller internal structure.
C2CI-0 C2CI-1 1:2 called c2c sharers in the directory data structure managed by
the HNF. It allows the HNF to store the off-chip sharers for
Request @0
1 Request @0 Request @0
the cache line. Additionally, we add an off-chip sharer field
Request @0
2 to the metadata of the CHI and C2C messages to carry this
Servicing Request 3 Request @0
[...]
information all the way to the HNF.
Pending req.? ReqAck [...]
4 When the RNF receives the expected messages to complete
Send ReqAck Request @0 5 a transaction, it sends an acknowledgment to the responder.
Upon receiving this acknowledgment message, the C2CI-0 in
Fig. 4: C2C request handshake mechanism flow diagram. Figure 3 forwards the message to the C2CI-1, populating a
field of the message with the corresponding remote sharer of
the line. The right side of the figure shows how the C2CI-1
C2CI-0 receives the yellow request. The request has the will use this field to inject the CHI message in NoC-1 with
acknowledgment flag. Since C2CI-0 is treating the blue request the right off-chip sharer field. When the HNF-1 receives the
the yellow request is not treated directly. A TBE flag for the acknowledgment and updates the directory entry, the field
pending request is set. In 4 , the blue request is finalized. is used to also update the c2c sharer field. Later, when the
C2CI-0 checks if there are pending requests and sends an HNF-1 sends a snoop to the C2CI-1, it will populate a new
acknowledgment message accordingly. In 5 , C2CI-1 receives field of the CHI message with the off-chip sharer, which will
the acknowledgment, dequeues the pending request of the be used to route the snoop request to the appropriate chip.
stalled queue and processes it. Finally, we add a second field to the CHI message metadata
Finally, the C2CI controller has a different port structure for to track remote initiators. This information is required by the
the network side and the C2C side. In Figure 3, network-side FSM of the C2CI to properly handle certain incoming requests.
ports (on the left of the C2CI-0) are labeled as Ruby ports. VI. E XPERIMENTAL S ETUP
These ports allow the C2CI to receive and send CHI messages Modeled architectures. We execute gem5 22.1 on a server
to and from the local controllers (i.e., RNFs, HNFs). The C2C- with an Intel Xeon Silver 4214R processor (2.4GHz) running
side ports are also Ruby ports but are connected to gem5 ports. CentOS Linux 7. We configure c2c-gem5 to model a two-chip
As indicated in Section IV, this architecture allows the C2CI architecture, as illustrated in Figure 1a. Each chip features
to have available ports, which are not connected to the NoC, a CPU with a 64 KB 4-way set-associative L1 data cache
to receive and send C2C messages via the C2C link. and a 1 MB 8-way set-associative L2 cache, one HNF, one
Tracking off-chip sharers and initiators. In addition to SNF, and one C2C interface. Each chip also includes a
the C2CI, we introduce minor modifications to the HNF memory controller connected to the SNF. Memory ranges
and CHI message format of the default gem5 CHI protocol are non-interleaved as follows: [0, 1.5 GB) for Chip-0 and
implementation. The HNF sends various messages to RNFs [1.5 GB, 3 GB) for Chip-1. Each memory controller includes
to maintain cache coherence. Since the CHI implementation one channel with 2 ranks, each rank containing 8 banks of
of gem5 is intended to work on monolithic chips, it lacks DDR3-3200 MT/s. We also model a single-chip reference
functionality to track the status of a cache line shared with architecture that has the same characteristics but without the
off-chip RNFs. As such, the HNF will consider the C2CI, C2C link (Figure 1b). All controllers are assembled around a
the controller initiating requests in the local NoC, as the unique NoC. Both architectures are simulated in full-system
sharer. This means that the HNF would only know that off- mode, running Ubuntu Linux 18.04. The OS boots using
chip sharers have a copy of the cache line address, but it non-caching simple CPU model to ensure fully deterministic
would be unable to identify the individual off-chip sharers. simulation while reducing boot time by bypassing the cache
Consequently, the HNF would be unable to issue the necessary subsystem. Upon entering the Region-of-Interest (ROI), we
snoop requests to maintain coherence with off-chip sharers. switch to the timing CPU model, which is more precise and
To address this issue, we simply add a new metadata field compatible with the Ruby subsystem.
Real hardware platform. We use a real hardware testbed to Chip-0 Chip-1
DRAM SNF-0 HNF-0 L1d-0 L2-0 C2CI-0 C2CI-1 HNF-1 SNF-1 DRAM
calibrate the C2C model. It is built around two Quad-core N1 ReadShare
Init d
ReadShare
(Arm v8.2A) Software Development Platforms (N1SDP [9]). ReadShare
d
d
Ack?
p
ReadNoSn
These two platforms use the Arm AMBA CMN-600 NoC MemoryR
ead Ack
ReadShare
MemoryD d
protocol and are interconnected through a specialized PCIe ata ReadShare
d
ReadNoSn
Data p
riser card using the CCIX protocol. Data
+ Delay
MemoryR
ead
ata
MemoryD
Data
Workloads. We evaluate c2c-gem5 using programs from the Data Data
Local CompAck Data
PARSEC [10] benchmark, selecting those that complete the access Finalize Data
Data
simulation within 15 hours in the reference architecture. Data
CompAck
PARSEC primarily represents multi-threaded applications that Data
Data
demand large amounts of shared memory. CompAck
Remote
Finalize access
VII. R ESULTS
Fig. 6: Access flow diagrams for local (green) and remote
OS integration. We validate the programmer’s view of our (yellow) accesses.
multi-chip model. Using full-system simulation, we can boot
the OS and run commands like the lscpu Linux command
to gain insight into the underlying hardware. We execute this validated this assumption using the Linux pagemap interface
command on the simulated platforms shown in Figure 1, and to inspect the corresponding physical addresses. We observe
the OS detects a 2-core processor in both the reference and an 8× difference between local access (approximately 100 ns)
the C2C architecture. When running the same command on the and remote off-chip access (approximately 800 ns).
real hardware platform, we observe a consistent result, with the We use trace injection to reproduce the same experiment
OS detecting an 8-core processor, while both interconnected with the simulated C2C architecture. Using the gem5 traf-
platforms each consist of a 4-core processor. fic generator component, we can inject custom traffic into
Microbenchmarking. We use the real hardware platform the memory subsystem. It allows us to target physical ad-
described in Section VI to calibrate the latency of the chip- dresses providing full control over the access pattern. Figure 6
to-chip link in our model. We design a microbenchmark to shows two different flow diagrams illustrating the transactions
measure DRAM access time using the following steps: (1) exchanged between Ruby controllers for local and remote
allocate a large array of 24 GiB (larger than the local DRAM DRAM accesses. On the left side, local addresses do not
capacity to force memory allocation into the remote memory), go through the C2C link. On the right side, we see the
(2) randomly select a position in the array, (3) read data request being routed to Chip-1. The data is then routed
from the selected position, (4) flush the read data, (5) go back. The figure also highlights in red the delay added to
to step 3 and repeat multiple times to reduce measurement calibrate the chip-to-chip link, emulating the behavior of the
noise, (6) measure the accumulated execution time and divide real hardware platform. The bottom graph of Figure 5 shows
by the number of reads to obtain DRAM access time, (7) the distribution of DRAM access times in the simulated model
go to step 2 and repeat for a new address. Running this after calibration. By comparing the two graphs, we conclude
microbenchmark on the real hardware platform allows us to that the simulation model can easily be calibrated to precisely
generate a histogram distribution of DRAM access times. reproduce real hardware access times.
The top graph in Figure 5 shows the resulting distribution of Simulating multithreaded applications. Figure 7 shows a
DRAM access times. Two distinct populations are identifiable: range of results obtained from running PARSEC workloads
one centered around 110 ns and the other around 810 ns. on both the reference and chip-to-chip gem5 architectures
The first population corresponds to local DRAM accesses, described in Section VI.
while the second represents off-chip DRAM accesses. We Figure 7a shows the execution times for the reference and
the C2C architectures before and after the chip-to-chip link
calibration described in the microbenchmarking subsection.
Local Accesses Remote Accesses We make two key observations. First, the reference and
baseline C2C architectures achieve very similar execution
times. The minor differences are due to variations in request
ordering caused by slight runtime variations. Second, the
Local Accesses Remote Accesses calibrated C2C architecture shows a noticeable slowdown in
some programs, while others remain unaffected. In particular,
canneal shows the largest execution time difference, with a
3× increase. Additionally, freqmine exhibits a 25% increase
in execution time. This result highlights the importance of
Fig. 5: Time distribution of local and remote accesses on the using calibrated models in simulation for accurate performance
hardware testbed (top) and the calibrated simulation (bottom). evaluation.
bla Percentage of Messages (Normalized)
Reference Arch. Tuned C2C Arch. C2CI Main Memory Data Responses
C2C Arch. Requests Snoops
2.5 100
Execution Time (seconds)
200 80
2.0
Bandwidth (MB/s)
1.5 150 60
1.0 100 40
0.5 50 20
0.0 0 0
bo les
ca k
ret
l
ray e
sw ce
ns
bo les
ck
fre t
l
ray e
sw ce
ns
bo les
ck
fre et
l
ray e
sw ce
ns
ea
ea
re
ea
c
in
in
in
r
tra
tra
tio
tio
tra
tra
tra
tio
tra
fer
fer
o
o
fer
nn
nn
qm
qm
o
nn
qm
ch
ch
ch
ap
ap
dy
dy
ap
dy
ca
ca
fre
cks
cks
cks
bla
bla
(a) Execution times (b) Memory vs C2C link bandwidth (c) Normalized C2C traffic types
Fig. 7: Comparison of (a) execution times, (b) main memory and C2C link bandwidth, and (c) normalized traffic types for
different PARSEC programs across various gem5 architectures: reference single-chip, default chip-to-chip, and calibrated chip-
to-chip.
Figure 7b shows the bandwidths for the C2C link and the On the other hand, cycle-accurate full-system simulators
main memory (combined for both local and remote main like gem5, with its Ruby CHI memory system, can accurately
memory controllers) in the calibrated C2C architecture. We model monolithic many-core architectures [24]. Some methods
make two observations. First, the applications with the highest coordinate multiple gem5 instances to simulate interconnected
slowdown also show the highest C2C bandwidth utilization, chips [25]–[27], but this setup requires each instance to run
with the slowdown being directly proportional to the band- its own OS. Other approaches, such as OpenPiton [28], use a
width utilization. Second, the C2C link bandwidth is typically similar strategy but incorporate RTL for faster FPGA emula-
lower than the aggregated main memory bandwidth but varies tion. In contrast, our work uniquely enables the simulation of
depending on the application. E.g., raytrace shows C2C systems consisting of multiple coherent chips running under a
link bandwidth comparable to the main memory bandwidth, single OS.
whereas bodytrack achieves only a third of the main
memory bandwidth. IX. C ONCLUSION
Figure 7c shows the normalized distribution of message
types traversing the C2C link. We observe a similar distri- We introduce c2c-gem5, a new cache-coherent chip-to-
bution for the applications exhibiting the highest slowdown: chip interconnect model to accurately simulate multi-chip
canneal, ferret, freqmine, and raytrace bench- architectures in gem5. To extend single-chip coherence to the
marks. However, the low proportion of snoop requests suggests multi-chip system, we propose a new chip-to-chip interface
that the slowdown is not caused by coherence traffic, but rather controller, and minimal modifications to the CHI directory
by the limited C2C link bandwidth. Thus, moving from the and message format to enable tracking of remote sharers
modeled PCIe-based interconnect to a more integrated chiplet- and initiators. We use a real hardware platform to calibrate
based interconnect could reduce the performance overheads our model and run PARSEC workloads on two CHI chips
observed in our experiments. operating under the same OS.
In future work, we will study the scalability of our model
VIII. R ELATED W ORKS across many chips and improve the accuracy of the local
NoC and chip-to-chip network by integrating our simulator
To our knowledge, this is the first work to propose a full- with a NoC simulator, such as Garnet. We believe that c2c-
system software simulation of multi-chip designs connected gem5 takes an important step forward in enabling accurate
using a cache-coherent chip-to-chip interconnect. However, we simulation of cache-coherent multi-chip architectures and hope
identify two areas of related work that are particularly relevant. to inspire future work exploring optimizations in such archi-
On the one hand, cycle-accurate NoC simulators such as tectures.
BookSim2 [19], Noxim [20], Nostrum [21], and Garnet [22],
originally designed for NoC simulations in monolithic chips, X. ACKNOWLEDGMENTS
have been extended for multi-chiplet designs [23]. These ex-
tensions are useful for conducting simulations and examining This work was partially supported by ANRT-CIFRE Grant
various network topologies and routing algorithms for given No. 2021/1201 and the European Processor Initiative (EPI)
traffic patterns. Our work is complementary, as we focus on project, supported by the European High Performance Com-
providing the functionality to extend coherency across multiple puting Joint Undertaking (JU) under Framework Partnership
coherent chips and thus produce realistic traffic that could be Agreement No. 800928 and Specific Grant Agreement No.
used by these NoC simulators. 101036168 (EPI SGA2).
R EFERENCES [20] V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti, “Noxim:
An open, extensible and cycle-accurate network on chip simulator,”
[1] S. Naffziger, N. Beck, T. Burd, K. Lepak, G. H. Loh, M. Subramony, and in Proceedings of the International Conference on Application-Specific
S. White, “Pioneering chiplet technology and design for the AMD EPYC Systems, Architectures and Processors (ASAP), 2015, pp. 162–163.
and RYZEN processor families: Industrial product,” in Proceedings of [21] Z. Lu, R. Thid, M. Millberg, E. Nilsson, and A. Jantsch, “NNSE:
the Annual International Symposium on Computer Architecture (ISCA), Nostrum network-on-chip simulation environment,” in Swedish System-
2021, pp. 57–70. on-Chip Conference (SSoCC), 2005.
[2] CCIX Consortium, “CCIX Consortium: Cache Coherent Interconnect [22] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “GARNET: A detailed
for Accelerators (CCIX),” 2024, accessed: 2025-01-17. [Online]. on-chip network model inside a full-system simulator,” in Proceedings
Available: https://www.ccixconsortium.com/ of the International symposium on performance analysis of systems and
[3] UCIe Consortium, “UCIe Consortium: Universal Chiplet Interconnect software (ISPASS), 2009, pp. 33–42.
Express (UCIe),” 2024, accessed: 2025-01-17. [Online]. Available: [23] H. Zhi, X. Xu, W. Han, Z. Gao, X. Wang, M. Palesi, A. K. Singh, and
https://www.uciexpress.org/ L. Huang, “A methodology for simulating multi-chiplet systems using
[4] J. Stuecheli, W. J. Starke, J. D. Irish, L. B. Arimilli, D. Dreps, B. Blaner, open-source simulators,” in Proceedings of the International Conference
C. Wollbrink, and B. Allison, “Ibm power9 opens up a new era of on Nanoscale Computing and Communication (NanoCom), 2021, pp. 1–
acceleration enablement: Opencapi,” IBM Journal of Research and 6.
Development, vol. 62, no. 4/5, pp. 8–1, 2018. [24] A. Portero, C. Falquez, N. Ho, P. Petrakis, S. Nassyr, M. Marazakis,
[5] C. Chen, J. Yin, Y. Peng, M. Palesi, W. Cao, L. Huang, A. K. Singh, R. Dolbeau, J. A. N. Cifuentes, L. B. Alvarez, D. Pleiter et al.,
H. Zhi, and X. Wang, “Design challenges of intra-and inter-chiplet “Compesce: A co-design approach for memory subsystem performance
interconnection,” IEEE Design and Test, vol. 39, no. 6, pp. 99–109, analysis in hpc many-cores,” in Proceedings of the International Confer-
2022. ence on Architecture of Computing Systems (ARCS), 2023, pp. 105–119.
[6] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, [25] A. Brokalakis, N. Tampouratzis, A. Nikitakis, I. Papaefstathiou, S. An-
M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharad- drianakis, D. Pau, E. Plebani, M. Paracchini, M. Marcon, I. Sourdis
waj et al., “The gem5 simulator: Version 20.0+,” arXiv preprint et al., “Cossim: An open-source integrated solution to address the sim-
arXiv:2007.03152, 2020. ulator gap for systems of systems,” in 2018 21st Euromicro Conference
[7] M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. on Digital System Design (DSD), 2018, pp. 115–120.
Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, “Multifacet’s [26] A. Mohammad, U. Darbaz, G. Dozsa, S. Diestelhorst, D. Kim, and N. S.
general execution-driven multiprocessor simulator (GEMS) toolset,” Kim, “dist-gem5: Distributed simulation of computer clusters,” in 2017
ACM SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 92–99, IEEE International Symposium on Performance Analysis of Systems and
2005. Software (ISPASS), 2017, pp. 153–162.
[8] ARM Limited, “Arm AMBA CHI architecture spec- [27] F. Schätzle, C. Falquez, S. Heinen, N. Ho, A. Portero, E. Suarez, J. Van
ification,” accessed: 2025-01-17. [Online]. Available: Den Boom, and S. Van Waasen, “Modeling methodology for multi-die
https://developer.arm.com/documentation/ihi0050/latest/ chip design based on gem5/systemc co-simulation,” in Proceedings of
[9] ——, “Arm N1SDP specifications,” ac- the 16th Workshop on Rapid Simulation and Performance Evaluation
cessed: 2025-01-17. [Online]. Available: for Design (RAPIDO), 2024, pp. 35–41.
https://developer.arm.com/Tools%20and%20Software/Neoverse%20N1 [28] J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov,
%20SDP M. Shahrad, A. Fuchs, S. Payne, X. Liang et al., “Openpiton: An open
[10] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark source manycore research framework,” ACM SIGPLAN Notices, vol. 51,
suite: Characterization and architectural implications,” in Proceedings no. 4, pp. 217–232, 2016.
of the 17th International Conference on Parallel Architectures and
Compilation Techniques (PACT), 2008, pp. 72–81.
[11] L. Bertran Alvarez and D. Novo, “c2c-gem5: Full system simulation of
cache-coherent chip-to-chip interconnects,” 2025, accessed: 2025-01-17.
[Online]. Available: https://gite.lirmm.fr/adac/c2c-gem5#
[12] ARM Limited, Arm Neoverse N1 Core Technical Reference
Manual, https://developer.arm.com/documentation/100616/0401/,
accessed: 2025-01-17. [Online]. Available:
https://developer.arm.com/documentation/100616/0401/
[13] SiFive, Inc., “SiFive Performance P870 Core Se-
ries,” 2024, accessed: 2025-01-17. [Online]. Available:
https://www.sifive.com/cores/performance-p870d
[14] Semidynamics Technology Services, S.L., “Tensor unit tech-
nology,” 2024, accessed: 2025-01-17. [Online]. Available:
https://semidynamics.com/en/technology/tensor-unit
[15] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5
simulator,” ACM SIGARCH computer architecture news, vol. 39, no. 2,
pp. 1–7, 2011.
[16] J. Randall, P. Benedicte, and T. Ta, “CHI-based Ruby protocol,”
https://github.com/gem5/gem5/commit/b13b4850951b4507cabee27a8c2,
2021, accessed: 2025-01-17.
[17] B. M. Beckmann and A. Gutierrez, “The AMD gem5 APU simulator:
Modeling heterogeneous systems in gem5,” in Tutorial at the Interna-
tional Symposium on Microarchitecture (MICRO), 2015.
[18] A. Gutierrez, B. M. Beckmann, A. Dutu, J. Gross, M. LeBeane,
J. Kalamatianos, O. Kayiran, M. Poremba, B. Potter, S. Puthoor et al.,
“Lost in abstraction: Pitfalls of analyzing GPUs at the intermediate
language level,” in Proceedings of the International Symposium on High
Performance Computer Architecture (HPCA), 2018, pp. 608–619.
[19] N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles,
D. E. Shaw, J. Kim, and W. J. Dally, “A detailed and flexible cycle-
accurate network-on-chip simulator,” in Proceedings of the International
Symposium on Performance Analysis of Systems and Software (ISPASS),
2013, pp. 86–96.