0% found this document useful (0 votes)
32 views81 pages

Rapp 1016

This thesis explores the design of an offload engine that connects the AXI protocol with the emerging UCIe standard for Die-to-Die communications, addressing the need for seamless integration of heterogeneous chiplet technologies. The work discusses the design process, challenges faced, and potential improvements, aiming to create a compatible interface that can enhance interconnection technologies. This research is significant for designers and researchers looking to integrate AXI-based architectures into heterogeneous packages.

Uploaded by

danialkrishtofer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views81 pages

Rapp 1016

This thesis explores the design of an offload engine that connects the AXI protocol with the emerging UCIe standard for Die-to-Die communications, addressing the need for seamless integration of heterogeneous chiplet technologies. The work discusses the design process, challenges faced, and potential improvements, aiming to create a compatible interface that can enhance interconnection technologies. This research is significant for designers and researchers looking to integrate AXI-based architectures into heterogeneous packages.

Uploaded by

danialkrishtofer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Architecture Exploration

for Die-to-Die Communications

Anshul Rao & Pinelopi Georgiou

Master of Embedded Electronics Engineering


Department of Electrical and Information Technology
Faculty of Engineering, LTH, Lund University
Host company: Ericsson
August 2024
Architecture Exploration
for Die-to-Die Communications

Anshul Rao & Pinelopi Georgiou

Master of Embedded Electronics Engineering


Department of Electrical and Information Technology
Faculty of Engineering, LTH, Lund University
Host company: Ericsson
August 2024
Abstract

Novel circuit design and manufacturing methodologies are emerging into the market. Het-
erogeneous Integration allows for chiplet technologies of different process nodes to be combined
in one large package. To allow for seamless communication between them, a definition of stan-
dards is necessary, and so is the construction of interconnects that follow them. In this thesis,
we present an offload engine which bridges the popular System-On-Chip (SoC) communica-
tions standard, Advanced eXtensible Interface (AXI), with the evolving Universal Chiplet
Interconnect Express (UCIe). We dive into the details of the offload engine design process, the
challenges, and the novice methods used to connect the AXI protocol to the UCIe interface.
The resulting architecture is meant to be mounted in the Protocol Layer of the UCIe model.
Balancing complexity, latency, and size of the design we justify each decision that was taken
along the process. In this work, we also present future improvements which can be made to
the design. The result is an interface which may be used as part of a Die-to-Die interconnec-
tion, fully compatible with the UCIe standard. It may be the start of a commercial product
or provide insights into interconnection technologies. This work is relevant for designers and
researchers alike who wish to integrate their AXI-based architectures into a heterogeneous
package.
Popular Science Summary

In the past decade, novel design and manufacturing methods have been introduced to
the field of integrated circuits. As society’s needs for design complexity grows and current
architectures, methodologies, packaging and fabrication techniques which are widespread in
the industry require constant upgrades. Miniaturization, or the shrinking of circuit components
is pushing technology to its physical limit. Heterogeneous Integration combines circuits with
different origins of fabrication and process technology for a more versatile production flow.
Meanwhile, there is also a need to bridge existing higher-level pre-existing standards with
these new techniques. In this study, we present our very own, AXI to Die-to-Die interface.
First, we introduce the background knowledge necessary to understand the context of our work,
as well as the standards we chose to follow, and why we follow them. We dive into the details
of the AXI protocol, which is widely used in SoC architectures, as well as UCIe, an emerging
die-to-die communication standard. Our work aims to bridge these two popular standards in
our very own Offload Engine. We will explain the details of our architecture, the challenges
we face and how we altered our work to overcome them, as well as the verification methods we
used to test our Engine. UCIe is an open, multiinstitutional effort to solve the heterogeneous
communication problem, in which multiple partners (Intel, Qualcomm, AMD, Arm, TSMC
and more) are actively involved. As the standard is developed, tested and integrated into more
and more designs, the relevance of our work comes to light.

i
Acknowledgements

We wish to express our deepest gratitude to our supervisors, Professor Liang Liu from
Lund University and Faruk Sande from the Ericsson BE team, for the knowledge and technical
support they have provided us throughout the development of this thesis. Their insights and
assistance have been key in our design.
We would also like to thank Ericsson Lund for supporting this thesis topic and providing
us with the environment to explore our ideas and expand our technical background. Our
supervisor Faruk Sande has guided us throughout the whole process, aided us in exploring our
ideas and providing invaluable feedback. We would also like to express our gratitude towards
Lund University, Faculty of Engineering (LTH) for the enlightening academic experience and
skills required for this thesis topic.
Beyond that, we would also like to thank the open source community of the UCIe Con-
sortium as well as the ARM community for providing us with the materials needed as well
as a space for discussion on the topic. Their support throughout the design phase has been
substantial.
To finish, we are also grateful to our friends and family during this time who stood by us
with unwavering support and motivated us to do our best. In addition, we are thankful to
each other for the natural collaboration and mutual drive to achieve our greatest work.

ii
Abbreviations

• SoC: System on Chip


• SiP: System in Package
• ASIC: Application Specific Integrated Circuits
• OSI: Open Systems Interconnection
• CMOS: Complementary Metal Oxide Semiconductor
• AXI: Advanced EXtensible Interface
• CHIPS: Common Heterogeneous Integration and IP Re-use Strategies
• AMBA: Advanced MicroController Bus Architecture
• AIB: Advanced Interface Bus
• APB: Advanced Peripheral Bus
• AHB: Advanced High Performance Bus
• FSM: Finite State Machine
• LSM: Link State Machine
• PCIe: Peripheral Component Interconnect Express
• ACK/NAK: Acknowledged/Not Acknowledged
• CXL: Compute EXpress Link
• QoS: Quality Of Service
• CRC: Cyclic Redundancy Check
• UCIe: Universal Chiplet Interconnect Express
• D2D : Die-to-Die
• RDI: Raw Die-to-Die Interface
• FDI: Flit-Aware Die-to-Die Interface
• FIFO: First-In-First-Out
• FD: Flit Deconstruction
• FC: Flit Construction
• PSM: Protocol State Machine

iii
Contents

1 Introduction 1
1.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Scientific Background 4
2.1 The dawn of Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 The OSI Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Another world: TCP/IP Reference Model . . . . . . . . . . . . . . . . . . . . . 9
2.4 Die to Die Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Universal Chiplet Interconnect Express . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 UCIe Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 The Protocol Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 The Die-to-Die Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.4 The Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 AXI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 AXI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 AXI Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.3 AXI Basic Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Pre-Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Design Implementation 33
3.1 Offload Engine Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 AXI Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Flit Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Protocol Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Architecture Development . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Flit Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Results 58
4.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.1 Block Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 System Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Conclusion 67
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Possible Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

iv
Chapter 1

Introduction

In the late 1960s, Frank Wanlass’s work towards developing Complementary Metal-Oxide-
Semiconductor (CMOS) technology laid the foundation for an exponential rise in the semi-
conductor Industry. CMOS is the basic framework for most electronic devices with notable
advantages over its Transistor–transistor logic (TTL) counterparts in terms of low consump-
tion in idea state, speed, and wider voltage range. The demand for faster and more efficient
chips has sparked immense developments.
However, despite the extensive work on minimizing the CMOS chips for smaller and faster
processing, several challenges and limitations have been identified. As the transistors decrease
in size and get more densely packed, notably 3nm chips used in the new Apple iPhones.
Power efficiency particularly becomes challenging, with increased power density and leakage
currents leading to higher power consumption per chip. Shrinking feature size leads to yet
another challenge, Interconnect scaling. The increase in resistance-capacitance (RC) delays
due to Interconnect scaling leads to limited signal propagation speeds hampering the overall
performance. Moreover, semiconductors are becoming susceptible to process variation and
reliability issues caused by electromigration and ageing effects on the chips, further depleting
on-chip performance. Finally, increased complexity in both design and fabrication processes
has increased manufacturing costs for the major players in the semiconductor industry.
Enter chiplets – modular semiconductor components that offer a novel solution to the
challenges of traditional monolithic chip design. By breaking down complex integrated chipts
(ICs) into smaller, specialized modules, chiplets empower greater design flexibility. The study
found that chiplet-based design approaches can achieve up to a 20% reduction in design time
compared to monolithic designs. Additionally, chiplets improve yield rates [1]. By isolating
functionalities into separate chiplets, defects in one chiplet can be discarded without impacting
the entire chip, leading to yield improvements of up to 30% according to Zhang et al. [2].
Chiplets also enhance customization options, allowing designers to mix and match chiplets
from various vendors to create application-specific SoCs.
The emergence of chiplets marks a paradigm shift in semiconductor technology, present-
ing a viable path to surmount the limitations of conventional monolithic chips and unlock
unprecedented levels of performance, scalability, and energy efficiency.

1
Chiplets offer potential solutions to the limitations of traditional chip design, yet their adop-
tion faces challenges. Integrating and coordinating communication between chiplets within
larger systems pose engineering hurdles. Concerns about yield, reliability, and testing strate-
gies remain areas of active research. Despite these challenges, chiplets enable incremental
upgrades, reduce time-to-market, and promote the reuse of proven designs and intellectual
property
The focus of this thesis lies in designing an offload engine that integrates a Die-to-Die
communication protocol like UCIe(Universal Chiplet Interconnect Express) to an established
internal SoC (System on Chip) communication protocol the AXI(Advanced eXtensible Inter-
face). With acquiring a working understanding of similar physical communication interface IPs
like PCIe(Peripheral Component Interface express), in the simulation environment provided
by Ericsson.
This thesis is divided into four parts. We start with a scientific background which aims
to provide a deeper understanding of the key concepts of AXI, UCIe, the protocol engine
and how they all work together. Next, we proceed with the implementation of the offload
engine through the designing and scripting of the architecture. Following the implementation,
the results will be presented to provide the evaluation of the potential and limitations of
the offload engine for Die-to-Die Communication integration. Finally, we will examine the
limitations, possible improvements, and other important considerations for the offload engine
in the discussion chapter.

2
1.1 Stakeholders

This thesis project was proposed and hosted by the ASIC SoC Integration and BE team
at Ericsson, Lund, Sweden. The company proposed the research topic, which focuses on
evaluating the Die-to-Die communication protocols and designing an offload engine to integrate
Die-to-Die communication in the SoC. Ericsson also provided essential resources and support
for the successful completion of this project

Table 1.1: Stakeholders involved in this thesis and their benefits

Stakeholders Benefits
Ericsson Lund 1. Academic and Technical support from the team.
2. Access to designing and simulating tools, as well as to the UCIe Consortium.

UCIe Consortium 1. Providing the standard which we follow.


2. Engaging in relevant protocol discussion in the forum.

3
Chapter 2

Scientific Background

In this chapter, we aim to cover all the necessary topics and provide background knowledge
so that the reader can understand the depth of our work.
There will be an overview of computer networks, the previously presented die-to-die sys-
tems, an analysis of the specification this work is focused on, and a presentation of the princi-
ples on which the work is based such as the commonly known PCIe (Peripheral Interconnect
Express) protocol layer and the AXI (Advanced EXtensible Interconnect) protocol, as well as
a look into a publication of a solution to the problem we worked to solve.

2.1 The dawn of Computer Networks

In the mid-1970s advancements in computer network technology grew rapidly. To maintain


coherence and to establish connectivity between different systems a universal standard would
have to be established; mirroring the current state of Die-to-Die technology is in today.
The innovation that led to the eventual forming of modern computer networks was a shift
from circuit-switching to packet-switching technology. The advantage of this technology is
that it can promote the sharing of computer resources. In circuit switching, a channel has
to be used exclusively by one connection in a set time. Packet switching is the delivery of
data packets of varying sizes on the same channel but with different destinations. A packet
communication network includes a transportation mechanism for delivering data between com-
puters or between computers and terminals. Various attempts at building packet-switching
computer networks were made in the late 1960s and 1970s with two notable, and government-
backed, networks being ARPANET [3] and CYCLADES [4] funded by the USA and France
respectively. The ambitious idea to connect computer networks from across the nation is now
deemed possible due to the resource-sharing capabilities of this novel type of network.
Beyond governmental research, more institutions from across the globe such as IBM or
European telephone companies recognized the vast potential of packet switching and started
working on their networks. A global interest to introduce a form of standardization arose,
to make packet switching viable across different computer networks. With this motivation in
mind, in 1972 the International Network Working Group (INWG) was formed [5]. However the
computer and telecommunications industries were not yet ready to accept the connection-less

4
Figure 2.1: Connections in the CY- Figure 2.2: Connections in the
CLADES network (1973) ARPANET network (1970)

style of communication Vinton Cerf, the first chairman of INWG, was proposing due to their
heavy investment in connection-orientated communications. He later resigned and started
working in ARPA (Advanced Research Projects Agency) on their packet-switching network.
In 1977, members of IWGN who represented the British computer industry proposed the Open
Systems Interconnection (OSI) [6] under a need to develop network standards needed for
open working. This proposal was made by the International Standards Organization (ISO).

2.2 The OSI Reference Model

The OSI Reference model is described as a framework to coordinate the development of


OSI standards, an abstract description of inter-process communication. OSI is concerned with
how systems exchange information, with the interconnection aspects of cooperation between
systems. The model is composed of abstraction layers, commonly known as just layers, meant
to give a clear presentation of the interconnection structure without specifying implementation
details. Inside each layer, independent functions are executed which aid the communication
process. Each layer provides value to the transmitted information which is available to the layer
above it, and all layers reside within the system. The rules and conventions in the conversations
between layer N of one machine and layer N of another are defined as a protocol. A protocol
may define things such as the frequency of data, the format of the data, and what needs to
precede it or succeed it. The list of protocols used by all layers in the system is called a
protocol stack. Each pair of adjacent layers is intervened by an interface.
In reality, the OSI reference model is not a model for implementation. It was created with
hopes of interoperability and standardization of networks across the globe but faltered due to
the lack of a strict, solid architecture. The predecessor to the OSI model, the Internet Protocol
Suite (commonly known as TCP/IP) which was the inspiration and foundation for OSI is still
considered the backbone of many systems, primarily the Internet. While working groups of
OSI pushed industry to follow the structure, of the Internet’s top advocates, Einar Stefferud,
put it simply: “OSI is a beautiful dream, and TCP/IP is living it!” [7]. Nevertheless, the
seven-layer model has been adopted as a way to talk about open systems networking, which
is the reason we present it in this thesis. Many concepts and terminology introduced by OSI

5
are considered the basic vocabulary of network architecture. A depiction of the OSI reference
model is shown in Figure 2.3.

Figure 2.3: Seven layer reference model [8]

In the reference model, seven layers are defined. A brief explanation of each layer follows
so that the reader may get a sense of the context in which our work fits.
• The Application Layer
The Application Layer serves as the gateway for an application process of a system to
access the OSI environment. An application process may be represented by an applica-
tion entity. Services provided by the application, besides the transfer of data, include the
identification of intended communication partners, the determination of the acceptable
quality of service, synchronization, agreement on security aspects, and many others. The
Application Layer comes with its own set of protocols, some of which are in the public
domain. A popular example is HTTP (the HyperText Transfer Protocol [9]) which is
published and maintained by the Internet Engineering Task Force (IETF). The Web is
the Internet’s client-server application which allows the transfer of information by the
users. HTTP, which is the Web’s application-layer protocol, defines the format and
sequence of messages exchanged between an Internet browser (the interactive interface
of the Web application) and the user. So application-layer services and protocols are
often just one part of a larger application. The IETF has published SMTP or Simple
Mail Transfer Protocol [10] known as the basis of electronic mail. It is but one piece
of an e-mail application, such as Microsoft Outlook. Other protocols may be propriety
and not available to the public, such for example the application-layer protocols used by
Skype.
• The Presentation Layer The Presentation Layer is responsible for the syntax and
semantics of the information transmitted between two application entities. This prevents
any problems related to a mismatch in the representation of data between application
entities- it provides syntax independence. Examples of this are data conversion such as
character code translation from ASCII to EBCDIC or integer to floating point. Different
file formats accessed by the Application Layer such as JPEG (Joint Photographic Experts
Group) or PDF(Portable Document Format) may be translated to a standard format
fit for transmission and visa versa. An example of an implementation of this layer is

6
NDR (Network Data Representation [11]) which is used by a software system called the
Distributed Computing Environment. Modern protocols which cover presentation layer
functionalities are not strictly defined in the layer reference model from OSI.
• The Session Layer The Session layer allows users on different machines to establish
sessions between them. It handles the synchronization and organization of dialogue
between two entities. Sessions created are tied to session addresses. Session Layer
functions also execute token management and synchronization- which is the reset of a
session the re-synchronization of a session, or the continuation of the dialogue from an
agreed re-synchronization point, or defined state, with potential loss of data. Famous
examples of protocols in this layer include the AppleTalk Session Protocol [12] which
was produced by Apple in the mid-1980s as part of the AppleTalk protocol suite but
eventually phased out when TCP/IP (Transmission Control Protocol/Internet Protocol)
networking standards became preferred over OSI.
• The Transport Layer The Transport Layer plays a vital role not only in the OSI
reference model but in most network communications models in use today, having been
widely adopted by designers. The Transport layer is the last end-to-end layer in the
model, meaning, it is the last layer in which entities have logical communication be-
tween one another uninterrupted by other entities. The layers under the Transport
Layer may only communicate with their immediate neighbors, and not between the ulti-
mate source and destination machines, which may be separated by other machines such
as routers or switches. The Transport layer connects Session Layer entities to transport-
addresses providing the bridge to their corresponding entities on another machine. It
then maps transport-addresses to network-addresses for the network layer, multiplexing
the transport connections onto network-connections. It also controls these connections,
implements error detection and error recovery on the end-to-end data, supervisory func-
tions and flow control and packeteting data into correct formats. The Protocol Layer in
the UCIe standard takes on the role of the OSI Transport layer in multiple ways. Most
notable examples of a Transport Layer protocol include UDP (User Datagram Protocol
[13])- which is connection-less and TCP (Transmission Control Protocol [14]) which in-
vokes a connected-oriented, and more reliable service to the invoking application. Both
protocols are published by the Internet Task Force, and both are used by the Internet.
• The Network Layer The Network Layer has arguably the heaviest load of all layers
in a communications model- and most reference models have their version of it. The
Network layer provides the functional and procedural means for connection between
entities of the Transport Layer. The layer can establish, maintain and terminate network-
connections. The original OSI reference model categorizes the many functions of the
network into sublayers or subgroups, with the most characteristic of them being routing
and relaying. The network layer assigns each transport entity a network address and
manages the connections between all the entities in the system by employing a routing
algorithm. These connections may be static or dynamic, remote or local- this is handled
by the network layer and obscured to the transport layer. The network layer has the
responsibility to deal with resource allocation and scheduling to avoid congestion in the
network. In many complex cases, the nature of this responsibility is complex, but it may
be simple, such as in a broadcasting network.
The network layer is not concerned with the physical means of the connection, which is
handled by the lower layers, but with the network layout on a device level. This layer and
its functionality exist in endpoints of a network, but also in intermediate devices, such
as routers. An example of a network and devices which employ network layer functions

7
is shown in Figure 2.4.

Figure 2.4: Network Example with visible abstraction Layers

• The Data Link Layer As we traverse down the reference model into the lower levels and
come closer to a physical implementation of a network we realise that the aforementioned
connections between entities of host devices are segmented into smaller connections or
links. The management of these links is managed by the data link layer. Looking
back at Figure 2.4, all devices are either routers or of higher complexity, but in reality,
there might be more intermediate devices, such as a switch, which hold data link layer
functionality. The Data Link Layer presents a seamless connection to the Network Layer,
applying error connection and sometimes flow control on the data sent by the link which
connects two data link layer entities. The passing from the network layer to the data
link layer is also segmented into data frames, which are of a smaller size than the
packets given by the transport layer. Two nodes, or endpoints of a link, communicate
by exchanging data frames and acknowledgement frames. There are two different types
of link layer channels- broadcast channels such as wireless local-area networks (WLANs)
or hybrid fiber-coaxial cables (HFCs) and point-to-point communication links such as a
long-distance link between two routers, or the connection between a computer and an
Ethernet switch. When it comes to a broadcasting channel multiple hosts are connected
to the same communications channel so a medium access protocol is used to coordinate
frame transmission. Coordinating access to a simple link is easier- an example of a
protocol which serves this is the Point-to-Point protocol, used in serial cables and phone
lines as well as cellular connections. Ethernet is a common data link protocol used in
WLANs.
The Link Layer is often implemented in a network adapter, sometimes known as a
network interface card (NIC). The controller inside the NIC is usually a special-purpose
chip therefore the link layer functions are implemented in hardware. Later on, we will
see the parallel between the data link layer and the ’adapter’ component of the UCIe

8
stack.
• The Physical Layer The Physical Layer can be considered as the implementation of
the Network Layer. It provides the mechanical, electrical functional and procedural
means to activate, maintain and de-activate network layer connections. Physical layer
entities are connected through a physical medium, through which the bits of information
and transmitted. The type of medium as well as the digital and analog circuits that aid
in its transmission are all part of the physical layer. The medium, or media, can be
grouped into two categories: guided, such as copper wires or fibre optics, or unguided
such as wireless or satellite transmissions. Phone networks across the globe rely on
electromagnetic transmission utilizing radio frequencies of the electromagnetic spectrum.
From the full 104 Hz to 1016 Hz electromagnetic spectrum, the different uses of it for
different physical media are shown in Figure 2.5.

Figure 2.5: The electromagnetic spectrum and its uses for communication [15]

There are various for encoding data into frequencies that have been developed over the
years. The process of converting digital bits and symbols into analogue signals is called
digital modulation. A modem (modulator, demodulator) is a device that converts a
stream of digital bits to an analogue signal. Along with modulation, hardware in the
physical layer also performs multiplexing, which is the transmission of multiple signals
on the same channel. Multiplexing may be done using time or frequency intervals, known
as Code-division multiple access (CDMA).

2.3 Another world: TCP/IP Reference Model

The TCP/IP Model is a model which predates the OSI model by 10 years. This reference
model has influenced much of the OSI architecture, but in turn, developments in OSI influenced
the trajectory of TCP/IP as well. It has its origins in the protocol Vinton Cerf published with
Robert Kahn [16]. This was the first true description of TCP, described by Vinton and Robert
as a Transmission Control Program, which has now grown to mean Transmission Control
Protocol, refined and defined as a standard in the Internet community. Part of the acronym,
IP, refers to the Internet Protocol used in its lower layers to format data packets.

9
The focus of the TCP/IP model is the protocols rather than the layer. That is why in its
visual representation, seen in Figure 2.6, the parallels between the reference model and OSI are
not straightforward. The rigid structure is limiting- omitting protocols in OSI allows the user
to replace them as the technology changes, which is the main purpose of the layered structure
in the first place. However, the fact that OSI was created with no protocols in mind led to its
demise as there was no practical experimentation to support the choices in the architecture.
Focusing on functionality and having methods by defining several specific protocols for the
model is what made TCP/IP so successful and widespread in today’s internet. Protocols used
in TCP/IP may also be used in an OSI context, such as in the Transport Layer of OSI as
previously described.

Figure 2.6: Model differences between OSI and TCP/IP [15]

Our analysis of computer networks ends here. For more information on network communi-
cation methods, as well as some insights into the politics that surrounded OSI and TCP/IP,
the reader may refer to Andrew Tanenbaum and David Wetherall’s "Computer Networks" [15]
as well as other previously cited sources. The purpose of this section was to lay the foundation
for the layered architecture that UCIe provides and to give a brief insight into how the layered
architecture concepts came to be in the first place. In the 1970s and through standardization
the world was striving to connect computers and devices of any type across the globe. In the
2000s and this work, we strive to connect chiplets of any die across the package, an innovation
which will prove to be just as revolutionary.

2.4 Die to Die Systems

In 1965 Gordon Moore observed that due to the growing complexity of circuit designs, the
number of transistors in an integrated circuit would double every two years [17]. To counter
the increasing area in integrated circuits, the semiconductor industry worked towards the most
logical solution to the problem: to decrease the area of the transistor. This movement, whose
basic parameter is largely the minimum dimension that can be lithographically reproduced:
ie, the gate length of the transistor, is known as miniaturization. However, the technology of
CMOS digital circuits has been significantly decreasing over the last decade with every new

10
node being ramped into manufacturing roughly every two years as designers and foundries
strive to keep up with the accruing amount of transistors needed for their ever-growing design
complexity.
The shrinking process technology pushes a circuit to its physical limitations and comes
with its own sets of problems- short channel, tunnelling, and threshold voltage effects have
to be taken into consideration and design methods have to be altered. Power loss, reliability
issues and temperature instability are especially new challenges enhanced in the nanometer
scale [18].
The approaching physical limitation plateau is acknowledged in the decisions and predic-
tions of industry experts. The International Technology Roadmap for Semiconductors (ITRS),
a set of documents created and published by the Semiconductor Research Corporation with
the sole purpose of technological assessment and predictions of trends, was long guided by
Moore’s law since 1998. Since the 2005 IRTS, an initiative has already begun that looked to
drive innovation beyond semiconductor scaling. Two sides of this initiative were More Moore,
or looking into novel digital integrated circuit methods such as Beyond-CMOS technologies
in parallel to miniaturization, and More than Moore, or focusing on the performance of
analogue components such as sensors. The objective of the roadmap is shown in a schematic
representation in Figure 2.7.

Figure 2.7: ITRS 2011 predicted trends in the semiconductor industry [19]

The combination of these two technologies is equally critical. The integration of CMOS and
non-CMOS-based technologies within a single package, also known as System-in-Package
(SiP), is becoming increasingly important. System in Package is a subset of a larger tech-
nological approach called Heterogeneous Integration. This superset combines different
packaging technologies to integrate dissimilar chips, photonic devices, or components with
different materials and functions, and from different fabless design houses, foundries, wafer
sizes and companies into a system or subsystem [20]. Heterogeneous integration encompasses
advanced packaging technologies such as multi-chip modules (MCMs), SiPs, 2.5D, fan-outs
and 3D-ICs. The interconnect standard which we follow in this work supports 2D, 2.5D and
as of this year 3D packaging, as we will discuss later.

11
Similarly to how computer networks were pioneered by the Advanced Research Projects
Agency Network of the United States Department of Defense (DARPA) in their ARPANET
network so is heterogeneous integration for chiplets [21]. In their CHIPS (Common Het-
erogeneous Integration and IP Reuse Strategies) program Board Agency Announcement they
propose to establish a modular design and fabrication flow for electronics systems which are
bound to an interface standardization. This refers to systems that can be divided into func-
tional circuit blocks, or chiplets, which are re-usable IP (Intellectual Property) blocks that
are predesigned and already realized in physical form. In short, the aim is to devise a modular
design and manufacturing flow for chiplets. In the BAA they solicit bids from outside compa-
nies to develop a large catalogue of third-party chiplets for commercial and military apps. As
proposed, CHIPS flow is expected to lead to a 70% reduction in design cost and turn-around
times.
In a multi-die architecture, chiplets need to be connected on a substrate or interposer
to provide data transfer across the whole package. A definition for such an interface that
binds chiplets to the interposer is then necessary to define and standardize the operation of
chiplet-based architectures.
When looking at a list of interconnect standards, most are proprietary and developed by
major semiconductor vendors, used in their environment for their advanced packages. There
are multiple initiatives, such as work by Open Compute’s Open Domain-Specific Architecture
Initiative which have been developing open standards such as the Open High Bandwidth
Interface (OHBI) [22] or Bunch of Wires (BoW) [23]. Notably, both are open standards,
encouraging the growth of a chiplet marketplace. However, both are only specified for the
PHY or physical layer. In this category also falls the Advanced Interface Bus (AIB) [24], first
developed by Intel and then taken on by the CHIPS alliance. Later in this chapter, we will
introduce a previous work with a similar objective to ours which utilizes this standard.
A new open standard that could significantly alter this dynamic is the Universal Chiplet
Interconnect express (UCIe) standard [25], which aims for chiplet interoperability across
vendors. UCIe has higher capabilities than the interconnects before and is defined across
multiple abstraction layers. Already in the UCIe specification 1.0, the UCIe protocol stack
has been established; The protocol layer, a die-to-die adapter layer, and the physical layer. It
draws from pre-existing standards: PCIe (Peripheral Component Interconnect Express) and
CXL (Compute Express Link), which are both open serial bus communication, interconnect
and architecture standards, making for an easier transition. Companies such as Synopsis and
TSMC have already released their IPs on the Die-to-Die adapter. The previous characteristics
are all reasons which make UCIe attractive and constitute the reason we follow this particular
standard for our work.

2.5 Universal Chiplet Interconnect Express

The creation of UCIe was motivated by the need to enable an ecosystem that supports dis-
aggregated die architectures. Much like OSI, UCIe is a layered, multi-protocol open standard-
with a specification defined to ensure interoperability across a wide range of devices with dif-
ferent performance characteristics [26]. UCIe supports multiple protocols already in practice
throughout the industry while also allowing the designer to implement their own, taking ad-
vantage of the rest of its layered architecture, a mode called Streaming. The designer is also
allowed to send raw data to the adapter, which is what we chose to do in our work.

12
An example of a package composed of heterogeneous chiplets connected by UCIe is shown
in Figure 2.8.

Figure 2.8: A package composed of CPU Dies, an Accelerator Die. an I/O Tile Die connected
through UCIe

By allowing compatibility with PCIe and CXL, UCIe can leverage the unique characteristics
of these standards, such as I/O or cache coherency, specifically meant to benefit the different
components. Learning from the past, with OSI’s shortcomings, UCIe is legitimized by pre-
existing standards that have been in use in the industry for years.
In this section, we will explain some key components of the standard in the next few sub-
sections. Firstly, we will present the architecture, talking briefly about each component. We
will largely focus on the Protocol Layer and its connection to the layer beneath it. This
information is from the current UCIe specification version 1.0, revision 2.0, published on
August 6th, 2024.

2.5.1 UCIe Components


The inspiration that UCIe draws from pre-existing standards can be seen in its protocol stack.
PCIe, which is also the basis for CXL, has a layered architecture which has been developed
over the years and has proven itself as a solid packet-based communication protocol stack.
The PCIe layer model is composed of a Transaction Layer, a Data Link Layer, and a
Physical Layer which links two PCIe devices together.
To understand the relation between PCIe and UCIe we have to highlight the purpose
and context in which it exists: PCIe is meant for devices in a PCIe architecture network-
where every device has a predetermined role and type in the network. One device can be
connected to many other devices, depending on its type and function, thus packets also have
to be addressed and routed. Since UCIe is meant to connect two chiplets as seen in Figure
2.8 the structure changes respectively, and the functions become more limited, specific, and

13
optimized. However, the basic structure is still there. We will refer to the PCIe stack further
in this section to explain the transition and context in which UCIe exists.
The UCIe layer stack can be seen in Figure 2.9 where the partition of its architecture is very
clear: a topmost Protocol layer, the intermediate Die-to-Die Adapter layer, and the Physical
Layer. Between the Protocol Layer and the Adapter rests the Flit-aware Die-to-Die Interface
(FDI) which is properly defined in the specification as a set of signals and different versions
with different widths for the mainband. Similarly defined is the Raw Die-to-Die Interface
(RDI) which is situated between the Adapter and the Physical Layer.

Figure 2.9: UCIe layers and functionalities of one device [26]

The UCIe data path on the physical bumps is organized in groups of Lanes or Modules. A
module is the smallest functional unit that is recognized by the AFE, or Analog Front End.
The maximum number of Lanes allowed for a Module is defined by the package type. An
instance of a Protocol Layer or D2D Adapter can send data over multiple modules according
to the bandwidth needs of the application.
The physical UCIe Link is composed of two connections: sideband and mainband.
 Mainband is the main data path of UCIe. Components of this connection are a for-
warded clock, a data valid pin, a track pin and data Lanes that make a module.
 Sideband is the connection used for parameter exchanges, register accesses for debug/-
compliance and coordination purposes or essentially the communication between the
Die-to-Die Adapters of connected devices. It consists of a forwarded clock pin and a
data pin in each direction. The clock is fixed at 800 MHz regardless of mainband config-
uration. The sideband logic is on auxiliary power and "always on" domain as it is used
to train, retrain, reset, and bring the devices out of a low-power state.
UCIe supports three heterogeneous integration packaging methods: Standard Package
which is 2D, Advanced Package or 2.5D, and as of August 2024, UCIe-3D, a fully vertical
packaging method.
For the Standard Package option, the number of Lanes can be 16, (Standard Package
x16) or 8 (Standard Package x32). This integration technology is used for low cost and long

14
reach, 10mm up to 25mm from bump to bump. The interface for the Standard Package is
shown in Figure 2.10, and the physical characteristics are shown in Table 2.1.

Figure 2.10: Standard Package Interface [26]

Index Value
Supported speed per lane 4 GT/s, 8 GT/s, 12 GT/s, 16 GT/s, 24 GT/s, 32 GT/s
Bump Pitch1 100 μm to 130 μm
Short Channel Reach 10 mm
Long Channel Reach 25 mm
Raw Bit Error Rate (BER) 1e-27 for ≤ 8 GT/s, 1e-15 for ≥ 12 GT/s

Table 2.1: UCIe Standard Package Characteristics

For the Advanced Package option, the number of Lanes can be 64, (Advanced Package
x64) or 32 (Advanced Package x32), as well as 4 additional pins for Lane repair purposes.
This integration technology is used for performance-optimized applications, with a channel
reach of less than 2mm from bump to bump. The interface can be realized with several
different interposers or methods, three of which are shown in Figure 2.11, and the physical
characteristics are shown in Table 2.2.
1
Bump pitch is the centre-to-centre distance between adjacent solder bumps on the semiconductor die or
substrate.

15
Figure 2.12: An example of 3D Die Figure 2.13: Connection of 2 chiplets
stacking [26] in UCIe-3D [26]

Figure 2.11: Advanced Package Interface [26]

Index Value
Supported speed per lane 4 GT/s, 8 GT/s, 12 GT/s, 16 GT/s, 24 GT/s, 32 GT/s
Bump Pitch 25 μm to 55 μm
Channel Reach 2 mm
Raw Bit Error Rate (BER) 1e-27 for ≤ 12 GT/s, 1e-15 for ≥ 16 GT/s

Table 2.2: UCIe Advanced Package Characteristics

For UCIe-3D, the UCIe architecture is significantly more durable than the previous pack-
aging technologies. Since in this scenario, chiplets are stacked one on top of the other, the
circuit and logic must fit within their bump areas, which must be identical. An example of
Die-to-Die stacking is shown in Figures 2.12 and 2.13.

16
Due to the high density of connections, lower operating frequencies and a simplified circuit
is allowed. The Die-to-Die adapter was also eliminated- there is no need for Retry or CRC
Mechanisms. The SoC logic connects directly to the physical layer. The physical layer is also
minimal, such as a simple inverter/driver. The width of each cluster, the id and the number
of leans for each module are also significantly increased: up to 80. Physical characteristics of
UCIe-3D are shown in Table 2.3.
Index Value
Supported speed per lane up to 4 GT/s
Bump Pitch ≤ 10mm for optimized mode,
> 10mm-25mm for functional
mode
Channel Reach is vertical 3D

Table 2.3: UCIe-3D Characteristics

The BER is low due to the low frequency and almost zero channel distance, however, the
spec doesn’t provide metrics as this addition is quite new.

2.5.2 The Protocol Layer


The primary role of the UCIe protocol layer is to handle the encapsulation of data into protocol-
specific formats known as Flits. These Flits are structured packets that enable the smooth
transmission of data between chiplets. The protocol layer supports multiple Flit modes, such
as PCIe Flit Mode, PCIe Non-Flit Mode, and both 68B and 256B Flit Modes as defined in
the CXL specification. Each of these modes serves distinct purposes, accommodating various
communication needs and ensuring compatibility across different protocols.
For PCIe communications, the protocol layer manages both Flit and non-Flit modes. In
Flit mode, it adheres to the PCIe Base Specification, ensuring that data packets are properly
formatted and transmitted. For non-Flit mode, the UCIe protocol layer utilizes the CXL.io
68B Flit Format, thus integrating non-Flit PCIe packets into the UCIe framework. This dual-
mode support is essential for maintaining flexibility and ensuring that the system can handle
diverse data traffic efficiently.
The protocol layer also supports CXL communications, with provisions for both 68B and
256B Flit modes. Each of the CXL protocols—CXL.io, CXL.cache, and CXL.mem—can be
negotiated independently, providing fine-grained control over data transactions. This inde-
pendence is crucial for optimizing the performance and efficiency of data transfers, as each
protocol can be tailored to the specific needs of the communication task at hand.
In addition to these well-defined protocol features, the UCIe protocol layer also includes
a streaming protocol that allows user-defined protocols to be transmitted. This flexibility
ensures that the UCIe framework can adapt to various custom protocols, further enhancing
its versatility. Additionally, a management transport protocol is included to facilitate the
transport of manageability packets, ensuring that system management and control messages
are efficiently handled.
The protocol layer’s functionality extends to critical operational aspects such as transaction
and flow control. It ensures proper sequencing and integrity of transactions, manages data flow
to prevent congestion, and implements error detection and correction mechanisms to maintain
data integrity. Furthermore, it maintains synchronization of data packets across different
chiplets, ensuring coherent and reliable communication.

17
Overall, the UCIe protocol layer is designed to manage the complexities of protocol-specific
data exchanges, ensuring that data is accurately formatted, transmitted, and received accord-
ing to the standards set by UCIe, PCIe, and CXL specifications. By supporting multiple Flit
modes and incorporating robust control mechanisms, the UCIe protocol layer plays a pivotal
role in maintaining the reliability, efficiency, and interoperability of chiplet-based systems.
Our design focuses on the 68B Format under the Streaming protocol, where we defined
AXI-based Flits and maintained a credit system and internal buffers to send and receive these
respective 128B flits. The reason for the choice of the streaming protocol is the flexibility of the
mode and the development of the offload engine focuses on creating unique flits that are de-
signed for AXI protocol, and the framework of having two instances of Flit Construction blocks
that generate and create flits for both the AXI Master and Slave interfaces simultaneously.

2.5.3 The Die-to-Die Adapter


According to the UCIe specification, the Die-to-Die Adapter is responsible for QoS services
with Reliable data transfer, Arbitration between different protocol Layers that may be con-
nected to it, as well as Initialization and Management of the Link. We will briefly discuss
these responsibilities in this subsection, focusing mostly on the features which are relevant to
our current work or could be implemented in a future improvement of our work.

Quality of Service in the Die-to-Die Adapter


The Adapter has features such as CRC computation, a Retry mechanism, and Runtime Link
Testing using Parity bits which are applied if the protocol layer has requested for them. A
brief explanation of the QoS, quality of service, and processes are explained below.
• Cyclic Redundancy Check, or CRC, is an error-detection code that has been used
in digital networks for decades. A polynomial function is applied to the bits of the data
and an output is produced. That output is packaged in the flit alongside the data. The
field for it varies on the type of flit chosen and was shown previously. The polynomial is
given in the specification:

(x + 1) ∗ (x15 + x + 1) = x16 + x15 + x2 + 1

this code produces a 3-bit detection guarantee meaning it can detect up to a burst of
3 consecutive bits which may have changed during transmission. The CRC is always
computed over 128 bytes of the message. For smaller messages, the message is zero-
extended in the MSB. Any bytes which are part of the 128B CRC message but are not
transmitted over the Link are assigned to 0. UCIe provides Verilog code for the CRC
code generation alongside the specification to the designers which is to be used as a
"golden reference".
• Retry is a mechanism which is found in the standards which are supported by UCIe.
In PCIe, there is a buffer called TLP (Transaction Layer Packet) Replay buffer found in
the PCIe Data Link Layer. In this buffer, packets are stored and deleted once there has
been an acknowledgement from the receiving device’s Data Link Layer that the packet
has been received properly (transaction completed). This is known as an ACK/NACK
protocol and the completion is sent in a DLLP (Data Link Layer Packet). The Retry
Scheme in UCIe is a simplified version of the modern PCIe Flit Retry mechanism. The
foundation which was described above serves to explain this mechanism, only in this
case, we have Nak and Ack flits.

18
• Runtime Link Testing using Parity is a UCIe mechanism to gauge the reliability
of the link by periodically inserting parity bytes in the middle of the data stream. This
is an optional mechanism enabled by the software of each device, then the appropriate
messages are exchanged via the UCIe sideband.
We will not look into the details of the sideband in this work as we do not utilize it, but
there is a specified sideband for control messages with specified formats and their flow in
UCIe much like PCIe messages. The dedication of a sideband for control messages
contributes to UCIe’s high performance.

Arbitration and Multiplexing in the case of multiple Protocol Layers


When enabled, two protocol stacks are allowed to share the same physical link while consuming
half the bandwidth. Both protocol layers are also allowed to independently utilize 100% of the
link in one scenario- in this case, the adapter must support round-robin arbitration between
them. In general, different arbitration schemes are allowed as long as they arbitrate per-
flit. The Stack multiplexer in the adapter will contain different Link state machines for each
protocol stack. Sideband messages sent by the adapter on the RDI will have fields indicating
which stack’s state machine the message is associated with.

Link Statement Management

The beginning of two device’s communication is preceded by a Link Initialization process.


During Link Initialization, the adapter communicates through the RDI with the remote part-
ner’s Adapter, and with the device’s protocol layer. This process happens every time a link
is reset, due to various factors. Different parts of Link Initialization involve different layers of
both devices and can be seen in Figure 2.14.

19
Figure 2.14: Stages of UCIe Link Initialization [26]

Stages 1 and 2 are relevant to the Physical Layer so we will discuss them in subsection 2.5.4.
Stage 3 is the responsibility of the Adapter, which acts according to information advertised
by the Protocol Layer. This process is divided into three parts:
• Part 1: Determine local capabilities. When the Physical Layer has finished Link Train-
ing, parameters about the link characteristics are made available to the protocol layer.
Link speed and configuration are among these, so then the Adapter determines if Retry
must be enabled for the Link Operation. The ability to support Retry would then be
advertised to the remote link partner through parameter exchanges.
• Part 2: Parameter Exchange with Remote Link Partner The parameter exchange
is the process in which each device advertises its capabilities to the other. This is done
by transmissions on the sideband by specific sideband messages the format of which is
specified in the UCIe specification. Among these capabilities are:
– Whether the device is an upstream or downstream port, a capability is relevant if
the device is a UCIe Retimer. Retimers are beyond the scope of this work.
– If Enhanced Multi-Protocol has been enabled, in the case of multiple protocol stacks
connected to the same adapter.
– The bandwidth allowed for Stank 0 and Stack 1 if multi-protocol stacks are enabled.
– If Management Transport Protocol is supported by the Protocol and Adapter Layer.
This new protocol is specific to version 2.0 of UCIe and it enabled the transport of
management port messages over the mainband.
– The various flit formats are supported in multi-protocol mode. 68B Flit Format,
Standard 256B End Header Flit Format, Standard 256B Start Header Flit Format,

20
Latency-Optimized 256B without Optional Bytes Flit Format, Latency-Optimized
256B with Optional Bytes Flit Format.
Parameter exchanges for different scenarios depending on the protocols supported are
shown in the specification but omitted from this thesis. The Adapter must implement a
timeout of 8 ms (-0%/+50%) for successful Parameter Exchange completion, including
all of Part 1 and 2. The timer only increments while RDI is in Active state.
• Part 3: FDI bring up flow This part is the most relevant for this work as it requires
our engagement as the protocol layer, and is reflected in section of Chapter 3. Once
FDI is in an Active state, it concludes stage 3 of the initialization stage and Protocol flit
transfer on Mainband may begin. The data width on FDI is a function of the frequency
of operation of the UCIe stack as well as the total bandwidth being transferred across
the UCIe physical Link (which in turn depends on the number of Lanes and the speed
at which the Lanes are operating). The flit formats which are allowed are decided by
the adapter before being communicated to the Protocol Layer.
Each FDI of a corresponding protocol stack has its state machine, the Adapter has its own Link
State Machine (LSM), and the RDI connected to it also has its state machine. The hierarchy
based on which the Adapter makes decisions is defined in the specifications and depends on
the state transitions. The transitions which are of importance to us are the transitions on
FDI requested by or reflected to the protocol layer so we will omit the details on this part,
however, the full FDI state machine can be seen in Figure 2.15. The Adapter has the following
capabilities:
• Retrain: The RDI propagates retraining of the link to all Adapter LSMs that are in
Active state.
• LinkError: can be raised by the physical layer or requested by the protocol layer.
• LinkReset or Disabled is negotiated with the remote partner via the sideband and
propagated to the RDI. LinkReset enabled the re-negotiation of parameters.
• Power Management States: L1 and L2. The main difference between these low-
power states is the way to exit out of them, either by resetting the link or just retraining
it. These states are meant to indicate to the Physical Layer to perform power manage-
ment optimizations. Also decided via sideband, the remote partner can choose to not
acknowledge the request.

21
Figure 2.15: State Machine of the FDI [26]

2.5.4 The Physical Layer

The physical layer is considered not only as the tangible aspect of the die-to-die intercon-
nect, but also the functions being done to prepare the data sent by the adapter, as well as the
gateway for communication with the linked device. The physical layer can be divided into two
functional entities: the Logical Physical Layer, and the Electrical Physical Layer.
The Logical Layer includes instructions and protocols that dictate the flow of flits on the
link. These are implemented by digital circuits. Besides managing the state of the physical
link, these functions are mostly associated with data transfer.
• Link initialization, training, and power management. Parameters that are ex-
changed between the two remote UCIe partners before data transfer can start include in-
formation about the link. There is a Link Training State Machine which is followed, dur-
ing which first the sideband is initialized (SBINIT) followed by the mainband (MBINIT),
which in itself has many sub-states. The parameters that are exchanged on the sideband
which aid during the training of the Mainband include:
– Voltage Swing
– Maximum Data Rate of the UCIe Link
– Clock Mode
– Clock Phase
– Module ID, for multi-module configurations
– UCIe-A x32, or the capability of an x64 Advanced Package Module to operate with
an x32 Advanced Package Module.

22
• Byte to Lane mapping for data transmission over Lanes has been a prime func-
tion of the physical layer since older versions of PCIe. Depending on the package and
lane width, bits can be placed in different configurations. An example of mapping of a
256B Flit standard on a Standard Package x16 is shown in Figure 2.16.

Figure 2.16: Mapping of 256B in an x16 interface [26]

• Interconnect redundancy remapping (when required) is a method used in Ad-


vanced Package to recover from faulty Lanes. It means remapping the bytes into the
remaining available lanes and disabling the unused Lanes. Remapping is done in the
Repair state.
• Transmitting and receiving sideband messages. The physical layer is responsible
for framing and transporting sideband packets over the UCIe Link, while direct sideband
access can originate from the Adapter or the Physical Layer, including register access
requests, completions or messages.
• Scrambling and training pattern generation. Scrambling is a technique used in
physical layers which eliminates the generation of repetitive patterns on the transmitted
stream. This is done because repetitive patterns result in large amounts of energy con-
centrated in discrete frequencies which results in significant electromagnetic interference.
Scrambling is done using a polynomial known to the receiver, which can un-scramble the
incoming data stream and obtain the data.
• Lane reversal is the inversion of the data lane’s connections within a module so that
they align with the numbering of the remote partner’s module pins. This is done by
reversing the Lane IDs.
• Width degradation, or managing differences widths, so that for example an x32 Ad-
vanced Package UCIe configuration may connect to an x64 Advanced Package UCIe
configuration.
The Electrical Layer constitutes the Analog Front End of UCIe. It varies between Ad-
vanced, Standard and UCIe-3D. One of the main functions of the Electrical Layer is connecting
and syncing to the Reference Clock, REFCLK, which comes from a single source on the
package and is distributed to both the Transmitter and the Receiver. It can be sourced to
the device from that package pin or forwarded from another Die on the package. The trans-
mitter circuit is composed of analog components such as deserializers, PLLs (Phase Locked
Loops), DLLs (Delay Locked Loops), and others. The Driver in the transmitter is optimized
for simplicity and low power consumption, with a voltage swing of 0.4V. The receiver circuit

23
is composed of FIFOs, de-skew circuits, etc. It uses a receiver clock to sample incoming data.
De-skew, when necessary, is performed during Training. UCIe specification also provides the
channel characteristics for both Standard and Advanced Package, as well as UCIe-3D.

24
2.6 AXI Protocol

The Advanced eXtensible Interface (AXI) protocol is a landmark of modern system-on-


chip(SoC) designs, providing a robust framework for high-performance communication between
sub-systems within the chip. The protocol was developed by ARM, in 2003 [27], alongside
the extended Advanced Microcontroller Bus Architecture (AMBA) which delivers an open-
standard, on-chip interconnect specification for the connection and management of functional
blocks in SoC. AXI has evolved along with multiple versions improving the capabilities to
meet the exponentially growing market demands.
AXI protocol released the AXI3 protocol version along with AMBA, providing a high-
bandwidth, high-performance and low-latency communication framework for a system-on-chip
environment. In 2010, ARM released the AXI4 iteration of the protocol which extended the
protocol’s capabilities. AXI4 supported up to 256 beats, AXI4 supported the QoS feature,
AXI4 removed write reordering making the protocol direct and faster in terms of its predeces-
sor. With its latest iteration of AXI5, the protocol is adding cache stashing, data protection
and poisoning signalling, and cache de-allocation transactions to make a robust and flexible
framework for various vendors and better integration.
AXI protocol is designed with an extensive feature set that has helped it to gain pop-
ularity over many similar protocols for communication in system-on-chip (SoC) design, like
AHB(Advanced High-Performance Bus), and PCIe.
• High performance: AXI offers burst-based transactions where each transaction can
send multiple sets of data with no delay, and a pipelined architecture to support parallel
processing and outstanding transactions issued concurrently. This improves both the
throughput and overall performance within the chip.
• Low Latency: AXI protocol has a mechanism where the data channels and the ad-
dress/control information channels operate independently of each other. This feature
helps to reduce the latency in data transfer. This feature becomes crucial when it comes
to high-speed operations.
• Flexibility: AXI protocol supports a wide range of operating data width and burst
lengths for data transfers, making the protocol adapt to be integrated with a variety
of applications and use cases. Thus AXI protocol is used from low-power devices to
high-performance competing systems.
• Inclusion of Unaligned Data transfer: The AXI interface includes address determi-
nation from the slave interface side which supports unaligned memory address locations.
This helps the system achieve efficient memory allocation access as a guarantee of data
alignment can not achieved in all cases.
• Quality of Service (QoS): AXI introduces a robust method prioritizing different types
of traffic. This helps to provide the necessary bandwidth for highly critical data transfer,
in turn maintaining integrity and improving the performance of the overall system.
• Parallel Processing: AXI protocol with its pipelined architecture caters to outstanding
transaction issues and out-of-order execution through robust cache coherency between
all the individual channels. This expands the performance when it comes to complex,
multi-core architecture chips which are introduced in this day and age.

25
This section provides an extensible overview of the AXI protocol, beginning with the fun-
damental principles and basic architecture. We will further move into the specific components
and the signals governing the AXI protocol. Through examining the transaction processes,
and practical implementation of the protocol we will have a better understanding of the AXI
protocol.

2.6.1 AXI Architecture


The AXI protocol is a burst-based, Master-Slave model, where the Master interface initiates
transactions and the Slave Interface responds. This architecture model provides great flexibil-
ity and data integrity during high-speed communications within System-on-Chip (soC).
AxI protocol implements five distinct channels to manage the address and data flow between
the interfaces :
• Write Address Channel (AW): This channel manages the address and control infor-
mation sent from the master interface to the slave interface indicating the address and
nature of the data to be transferred.
• Write Data Channel (W): The channel manages the data transfer of data from the
master interface to the slave interface.
• Write Response Channel (B): This channel informs the write operation’s response
back to the master interface.
• Read Address Channel (AR): This channel manages the address and control infor-
mation sent from the master interface to the slave interface indicating the address and
the nature of data to be received.
• Read Data Channel (R): This channel manages the data transfer from the slave
interface to the master interface along with response information sent through.

26
Figure 2.17: AXI4 Read Transaction [27]

Figure 2.18: AXI4 Write Transaction [27]

The separation of address/control and data channels allows AXI to implement a pipelin-
ing architecture, where outstanding transactions and be supported, thereby improving the
throughput and latency of the interconnect.

27
2.6.2 AXI Components
Channel definitions
All the five independent channels of the AXI protocol consist of a set of information signals
and use a two-way VALID and READY handshake mechanism. The source uses VALID
to indicate valid data or address/control information on the channel. The destination uses
READY signal to indicate the readiness to accept data or address/control information.
Address Channels
Both the read and write transactions have their address channel. The address channel
carries the address and all the required control information for the transaction. The protocol
supports a wide range of mechanisms. These follow variable-length bursts up to 16 data
transfers per burst, varied transfer sizes of 8-1024 bits, wrapping, incrementing and non-
incrementing bursts and system-level caching control.
Read data Channel
The read data channel carries both the read data and the response information to the read
data from the slave interface to the master interface. The channel supports varied data bus
sizes up to 1024 bits and extensive read response showing the completion status of the read
transaction.
Write data Channel
The write data channel carries the write data from the master interface to the slave interface
to be written. The channel supports varied data bus sizes up to 1024 bits wide and includes
a strobe lane to indicate which bytes of data bus are valid.
Write data channel information is always treated as buffered so that the master can perform
write transactions without slave acknowledgement of the previous write transactions.
Write response Channel
The write response channel provides a way for the slave interface to indicate the write
completion response to the master interface. The completion occurs for each burst transaction
not single beats of data.

AXI Interconnect
Bigger systems consist of several master and slave interfaces connected through the Inter-
connect component. The interconnect services as the bridge between a multi-system AXI
implementation. The interconnect plays a crucial role in keeping the balance between the
interface complexity and maintaining system requirements. Most systems use one of the three
interconnect approaches:
• shared address and data buses
• shared address buses and multiple data buses.
• multilayer, with multiple address and data buses

28
Figure 2.19: AXI Interconnect [27]

AXI Register Slices


The benefit of unidirectional channels in AXI protocol and no fixed relationship with the
other AXI channels enables the insertions of register slices at any channel, with the cost of
one additional clock cycle latency. This method helps to introduce slicing to work between
the trade-off maximum frequency of operation and the cycle latency. This concept can be
implemented where we can have direct, and fast communication between the processor and a
high-performance memory while using register slicing for the less performance-critical opera-
tions.

2.6.3 AXI Basic Transaction


This section demonstrates the basic workings of the AXI protocol transactions. We will discuss
the VALID and READY handshaking mechanism. The golden rule of AXI protocol is, that
the transfer of either address information or data occurs when both the VALID and READY
signals are HIGH.
The two-way flow control mechanism enables both the master interface and the slave in-
terface to control the rate at which the data and the control information move on the AXI
channels. Always the source unit generates the VALID signal to indicate when there is valid
data or control information signal available on the bus channel. The destination unit generates
the READY signal indicating that it can accept the data or control information signal.
We can observe an AXI transfer, a single exchange of information, with one VALID and
READY handshake.

Figure 2.20: AXI4 Hanshaking

29
Write Transaction:
The master interface initiates the AXI write transaction by issuing a valid address signal on
AXI Write bus, AWADDR and issue all the control information respectively to indicate the
nature of the data transfer and assert the AWVALID to indicate that the address channel is
valid. AWVALID signal must remain asserted until the slave accepts the address and control
information by asserting the AWREADY signal.
Once the address channel completes its transaction the master interface can now initiate the
data transfer by using the data channel signals like WDATA, WLAST etc. through asserting
WVALID signal throughout its data transfer across multiple beats. The WVALID signal
remains asserted till all the data is accepted by the slave interface by asserting WREADY
signal.
The slave interface asserts the BRESP signal when all the valid data has been accepted
by the slave interface along with BVALID signal, indicating the master interface about the
transaction completion. The BVALID remains asserted till the master can accept the signal
by asserting BREADY signal.
We can visualise these communication signals from the figure which depicts the nature of
the independent nature of the channels and the coordination of the address and data flow.

Figure 2.21: AXI4 Write transaction [27]

30
Read Transaction
AXI read transaction utilizes the read address and Read data channels, master interface initi-
ates AXI read transaction by issuing the valid address signal on the AXI read bus, ARADDR
and issues all the control information to indicate the nature of the data transfer and assert the
ARVALID, indicating that the address channel is valid. ARVALID remains asserted until
the slave accepts the address and control information by asserting AREADY.
In AXI protocol, the read transaction is different from the read transaction, where the read
data and read response are transferred on the same channel. Once the address is captured the
slave interface initiates valid read data to be sent to the master interface via the RDATA and
RRESP signals, while asserting RVALID signal. The RVALID is asserted till the RLAST
is asserted indicating the completion of read data.
The read transactions are depicted in the figure mentioned below.

Figure 2.22: AXI4 read transaction [27]

31
2.7 Pre-Existing Work

Recent advancements in chiplet-based system design have highlighted the importance of


integrating various open-source Die-to-Die (D2D) interconnect PHY/MAC layer vendors into
system-on-chip (SoC) architectures. This integration is crucial for achieving optimal system
performance in terms of data throughput, latency, and consistency across chiplet ecosystems.
Early work in AXI4 integration focused on combining AXI4 with the Advanced Peripheral
Bus (APB) bridge, dating back to 2011 [28]. Subsequent research built upon these foun-
dational implementations, with developments incorporated into larger SoC ecosystems. More
recent studies have explored practical solutions for integrating AXI4 within D2D architectures.
Notably, the paper "Open-Source AXI4 Adapters for Chiplet Architectures" [29] presents a
pioneering approach by demonstrating how open-source AXI4 adapters can be effectively used
to address integration challenges in chiplet-based systems, enhancing communication and in-
teroperability across diverse chiplet designs.
The existing work on adapters is grounded in the AXI-Lite protocol, a streamlined subset
of the AXI (Advanced eXtensible Interface) protocol, designed to offer an efficient method
for accessing and managing peripherals and memory-mapped registers within system-on-chip
(SoC) architectures. The AXI-Lite protocol provides a more simplified interface compared
to the full AXI4 protocol, which helps reduce communication overhead. This design utilizes
the Advanced Interface Bus (AIB2.0) physical interconnect standard, a high-performance,
standardized Die-to-Die (D2D) interconnect bus, and includes a robust adapter that integrates
the AXI interface with AIB2.0 using a credit FIFO mechanism to manage credit exchange and
ensure smooth data transfer.
Our design follows the foundational principles and methodologies of this framework but
introduces several key deviations to enhance performance and adaptability. Firstly, we have
adopted the UCIe (Universal Chiplet Interconnect Express) D2D interconnect bus, which
is emerging as an industry-standard framework offering superior data throughput and lower
latency compared to AIB2.0. Secondly, we have incorporated AXI-based flits alongside a credit
flow mechanism, enabling seamless integration with any flit-based interconnect technologies.
These enhancements achieve a more advanced and versatile integration of the AXI4 protocol
within the D2D ecosystem, addressing specific requirements and optimizing performance in
contemporary chiplet-based systems.

32
Chapter 3

Design Implementation

In this chapter, we will first present the necessity of an offload engine to integrate the system-
on-chip (SoC) chiplet framework with a Die-to-Die interconnect, specifically UCIe (Universal
Chiplet Interconnect Express). We will then describe the detailed process of designing the of-
fload engine, elaborating on the approach towards developing the subsystems that constitute
the offload engine, such as the AXI interface, protocol engine, and flit construction/deconstruc-
tion. In addition, we will discuss the challenges encountered during the design of these blocks,
providing insights into the solutions and design considerations that were implemented to over-
come these obstacles. This comprehensive overview aims to offer a clear understanding of the
offload engine’s design and its critical role in enhancing the functionality and performance of
the SoC in the chiplet framework.

3.1 Offload Engine Overview

After introducing AXI, and UCIe, the task that our Offload Engine has to complete is clear.
The AXI protocol has multiple channels through which transactions are sent and completed in
parallel. The UCIe Protocol Layer is the abstraction layer in which our Offload Engine resides,
and the role it completes is to send chunks of data to the Die-to-Die Adapter in the appropriate
format, along with the proper control signals. In the separation of the engine into separate
components emerges an AXI Engine, which drives the different AXI4 channels, receiving and
sending multiple streams of data simultaneously; a Protocol Engine to drive the FDI and
receive data from it while also taking into account the state of the link through the Die-to-Die
Adapter’s control signals; and finally the blocks between them which we have functionally
named Flit Construction and Flit Deconstruction. There will be subsections diving into
the details of these modules and the definitions we have created to serve the data translation
process. These definitions are none other than our own original Flit Architecture: a data
packet with distinct headers and bit fields optimized to service AXI. A complete, block-level
overview of our architecture, with each aforementioned component distinctly marked, can be
seen in the block diagram of Figure 3.1.

33
34
Figure 3.1: Offload Engine Complete Block Diagram
The key feature of this design and our main challenge while developing it was the serializa-
tion of AXI channel data and maintaining its latency while also maintaining reliability. This
is done by keeping track of the transactions sent on the FDI to the UCIe Link through our
Transaction Table mechanism. We have to accommodate UCIe’s link mechanisms, utilizing
its control signals to apply backpressure to the AXI so that we do not lose any data. A Proto-
col State Machine is needed to mirror FDI’s state machine. Each block applies back pressure
to the block before it with a handshake mechanism. We found this method effective in the
Offload Engine’s functionality. Our Engine also employs various buffers throughout the design
to minimize data loss and accommodate the different data rates. Further contributing to the
quality of service is an added Credit Flow Control, in the Credit Handling block.
The beauty of it is scalability. Key aspects of the engine are parameterized: Functional
parameters like the size of the buffers, the number of them, the AXI data size, AXI ID size,
but also design characteristics such as the maximum number of outgoing transactions, the bit
fields in the header, the supported AxSIZE. This made it incredibly easy to change the design
if necessary, making the design customizable for different application scenarios. The design
which we came to implement is after a series of calculations and trial-and-error which we will
include in the following sections.

3.2 Design

This section will provide the in-depth design methodologies incorporated during our journey
of developing the offload engine. We will deep dive into the essential major subsystems that
make up the offload engine, which are the AXI Engine, Flit Construct/Deconstruct, and the
Protocol Engine. We will further provide the idea behind our design choices, novice design
implementation in terms of AXI-based Flits and the challenges faced in all the sub-systems.

3.2.1 AXI Engine

The AXI Engine is a crucial foundational component in the development of our offload
engine, acting as the essential interface connecting directly with the AXI bus in the system-on-
chip (SoC). This engine ensures efficient and reliable data transfer between various subsystems
within the chip, fully leveraging the robust AXI protocol. Our AXI engine design consists of
parallel AXI Master and Slave Interfaces, each further divided into dedicated write and read
channels. This approach adheres to the AXI protocol’s principles of parallel processing for
both read and write channels, ensuring optimal performance and reliability.

AXI Engine challenges


Understanding the complexities of the AXI4 protocol presents significant challenges. One must
grasp numerous details to avoid common mistakes that can compromise system performance.
For example, correctly managing the VALID and READY signals is vital; improper handling
can lead to data loss or transfer delays. Ensuring that the handshaking process between these
signals operates independently of other conditions is another critical aspect. The design must
prevent scenarios where VALID signals are assigned or de-assigned incorrectly, which could
disrupt the seamless data flow. Addressing these nuances requires a deep understanding of
the protocol’s intricacies and rigorous testing to validate the implementation.

35
Another common mistake in AXI4 implementation is the improper handling of burst trans-
actions. The AXI protocol supports different burst types, such as fixed, incrementing, and
wrapping bursts, each with its specific use cases. Incorrectly configuring burst parameters
can lead to inefficient memory access patterns and degraded performance. To avoid this, our
design meticulously follows the protocol specifications for burst transactions, ensuring that the
appropriate burst type is used for each scenario and that burst lengths and sizes are correctly
configured.
Managing data alignment is also crucial in AXI4 interfaces. Misaligned data can cause
additional latency and require extra processing to realign the data correctly. Our design
ensures that data is always aligned according to the AXI protocol’s requirements, minimizing
the need for realignment and optimizing data transfer efficiency.
Handling outstanding transactions is another area where mistakes can easily occur. The
AXI protocol allows multiple outstanding transactions, but this flexibility can lead to issues
if not managed correctly. For example, exceeding the allowable number of outstanding trans-
actions can overwhelm the system, leading to stalls and decreased performance. Our design
incorporates throttling logic to restrict the number of simultaneous transactions, ensuring that
the module processes each transaction efficiently and within the system’s capacity.
The AXI master interface within our offload engine is designed to handle high throughput
data transactions by leveraging a robust framework that ensures parallelism and data integrity.
The interface module is structured with independent address and data always block, allowing
for parallel processing of address and data signals, thus enhancing overall performance and
reducing latency. Both the AXI Master and Slave Interfaces are designed with separate al-
ways blocks for handling address and data transactions. This separation allows for parallel
processing, enabling the module to issue address signals and transfer data simultaneously.
The address always block manages the generation and processing of address signals within
the offload engine for the AXI bus, handling Address Write (AW) and Address Read (AR)
channels independently. The data always block is responsible for managing the data payload
pushed into the AXI Data bus, handling the Write Data (W) and Read Data (R) channels
separately, ensuring efficient execution of data transfer operations.

AXI Throttling mechanism


To manage and control the flow of data transactions, the AXI Master/Slave Interface incorpo-
rates a throttling logic. This logic is designed to restrict the module from accepting more than
a preset number of outstanding transactions from the AXI bus. By limiting the number of
simultaneous transactions, the throttling logic prevents the module from being overwhelmed,
ensuring that each transaction is processed efficiently and within the capacity of the system.
This mechanism helps maintain a balanced and steady data flow, preventing bottlenecks and
ensuring smooth operation. The throttling logic operates by monitoring the status of ongoing
transactions and controlling the acceptance of new transactions based on predefined thresholds,
ensuring the AXI Master/Slave Interface can handle high traffic loads without performance
degradation.

AXI Handshaking mechanism


The AXI Master Interface employs a double handshaking mechanism in both the address phase
and data phase as the information is directed from the protocol engine. This ensures reliable
data transfer through the protocol engine where the Flit Deconstruction module initiates the
transaction depending on the availability of the AXI Bus. Transactions are first issued from

36
the Flit deconstruction module, part of the Protocol Engine, with its handshaking mechanism.
This ensures that data and address information are correctly synchronized and validated before
being processed further in the AXI bus interface. Once validated and issued from the protocol
side, transactions are passed through to the AXI bus using the AXI protocol’s handshaking
mechanism, involving standard AXI signals to ensure that data is transferred correctly and
acknowledged by the receiving end.
The AXI Slave Interface includes response logic that differs from the Master Interface,
requiring responses to be sent on different channels. After completing a write transaction,
the slave sends a response on the Write Response (B) channel, which is separate from the
Write Data (W) channel, ensuring that write operations can continue independently of the
response signals. For read operations, the slave calculates the response and sends it along with
the read data on the Read Data (R) channel, ensuring that data and response are delivered
together, simplifying the read transaction process. The addressing logic in the AXI Slave
Interface computes the next address based on control information received from the Address
Write (AW) and Address Read (AR) signals. This logic ensures an accurate determination of
the target address for each transaction, maintaining data integrity and synchronization.

AXI FIFO mechanism


The AXI Interface utilizes FIFO (First-In-First-Out) buffers to manage incoming transactions
effectively, ensuring data integrity before they are processed into AXI-based flits. FIFO buffers
provide a staging area where entire transactions can be stored, ensuring that all data associated
with a transaction is available and intact, preventing partial data transfers or corruption. This
complete storage allows for accurate and reliable conversion into AXI-based flits. The address
write (AW) and address read (AR) FIFOs store address information with a width matching
the entire data width, temporarily holding address signals before processing. The write data
(W) and read data (R) FIFOs are complex designs consisting of multiple FIFOs capable of
holding different outstanding transactions. They ensure that each data packet is stored and
retrieved efficiently, with a special in-issue FIFO tracking which data FIFOs are in use and
which ones are complete and ready for reading. The write response (B) FIFOs manage the
responses for write transactions, ensuring that the response is sent after the write operation
is completed, maintaining the integrity of the transaction flow.
In conclusion, the AXI Master/Slave Interface design within our offload engine is meticu-
lously structured to handle high throughput data transactions while ensuring data integrity,
efficient processing, and robust performance. The modular approach, independent always
blocks, throttling logic, double handshaking mechanism, and FIFO buffers collectively con-
tribute to a highly optimized and reliable system. By addressing and avoiding common pitfalls
in AXI4 implementation, such as incorrect VALID and READY signal management, improper
burst transaction handling, data misalignment, and mishandling of outstanding transactions,
our design ensures robust performance and reliability, making it a key component in the SoC
architecture.

37
Figure 3.2: AXI Interface in the Offload Engine framework

3.2.2 Flit Architecture

Early on in the design, it was clear that a Flit specific to our offload engine would have to
be designed. This flit would provide us with all the necessary bit fields to achieve the target
reliability. It is also necessary to properly decode the outgoing and incoming data. For this
purpose, we defined 4 distinct flit types which we will present in this section. We will also
present and discuss the modules responsible for encoding and decoding the information to and
from the flits. The total size of the Flit, stored as the parameter FLIT_CHUNK in our design,
stems from calculations we did based on AXI limitations and UCIe definitions. We will also
discuss these in the following section.

38
Flit Formats

We define 4 types of flits in our architecture:


• ARAW: Address Read/Address Write Flit
• R: Read Flit
• W: Write Flit
• B: Write Response Flit

Figure 3.3: Standard and AXI Header

The two headers for our flits can be seen in Figure 3.3. All types have the Standard Header
at the head of the flit. The AXI header is reserved only for the ARAW flit, and it is essentially
a collection of all the AXI address read/write signals, which have pre-defined widths by the
AXI specification [27]. The Standard Header however consists of our bitfields:
• TID: Transaction ID. Define the type of Flit. Its encodings can be seen in Table 3.1.
• Tag: Unique Identifier for a transaction.
• ID: The associated AXI ID.
• Credit: The credit handling is explained in detail in Section.

TID Value Encoding


00b AR/AW
01b R
10b W
11b B

Table 3.1: Transaction ID Encodings

Except for the AXI signals found in the AXI header, every other bit field’s length is a
parameter which may be altered within the bounds of the total header size.
AR/AW Flit
The structure of the AR/AW Flit can be seen in Figure 3.4.

Figure 3.4: Address Read/Address Write Flit

39
From Most Significant Bit to Least Significant, first we use the Standard Header which was
previously defined, then the AXI Header. A single-bit field follows to indicate if the Flit is for
a Read (0) or a Write (1). The rest of the bits are not in use, and they must be 0.
Payload Calculations for Read and Write Flits
The AXI standard provides specifications for the format in which read or write data can be
transmitted. The granularity of transmitted data is defined by AxSIZE, defining the size of a
beat. The number of beats is defined by AxLEN. These two fields are 3 and 8 bits respectively.
But beyond the limits imposed on them by their bit size, there is one more limitation: the
transmitted data cannot surpass 4 KB. Because of this limitation, we can know how
many transfers are necessary for a transaction of the maximum size. For a given Standard
Header size and with the FLIT_CHUNK parameter predetermined by the FDI width, by
tweaking the Sequence field (the order of a flit in its specific transaction) we calculated how
many flits are needed for every combination of AxLEN, AxSIZE which gives the maximum
amount. These numbers are defining for our architecture in that they define the size of buffers
throughout the entire design. In Table 3.2 we summarize these calculations for FLIT_CHUNK
= 64 Bytes.

AxSIZE Maximum Total Flits Maximum Total Flits Maximum


AxLEN Needed Beats in Needed Beats in
(bytes) (Read) Read Flit (Write) Write Flit
CASE 1 1 256 6 47 5 52
CASE 2 2 256 10 26 10 26
CASE 3 4 256 19 14 20 13
CASE 4 8 256 37 7 43 6
CASE 5 16 256 86 3 86 3
CASE 6 32 128 128 1 128 1
CASE 7 64 64 128 0.5 128 0.5
CASE 8 128 32 128 0.25 128 0.25

Table 3.2: Flit Calculations for maximum size transactions

For this fixed FLIT_CHUNK we note that in AxSIZE cases 7 and 8 the 8-bit Sequence
field is not big enough to represent the flits in a maximum data transaction properly. This is
why we had envisioned a split as seen in Figure 3.5, and split the beat within the flits into
halves for case 7 and quarters for case 8. However, we did not have the time to implement
this mechanism into the hardware but rather saw it as an expansion of this work.

Figure 3.5: Utilizing the Sequence Field for AxSIZE above 32 bytes.

40
R Flit
The structure of the R Flit can be seen in Figure 3.6.

Figure 3.6: Read Flit

Besides the Standard Header, the payload is preceded by two more fields, the Sequence field
which holds the order number of the flit in the current transaction, as well as a Start/Stop
bit, the encodings of which are shown in Table 3.3 and remain the same for the W Flit. It
is necessary to have this for the receiving engine to know when a transaction has finished, in
other words, which flit is the last in the transaction.

Start/Stop value Encoding


b00 First Flit in transaction
b01 Intermediate Flit
b10 Reserved
b11 Last Flit in transaction

Table 3.3: Start/Stop Encodings

The reason that in Table 3.2 there are different calculations for Read transactions is that
each beat in a Read flit is accompanied by the 2-bit AXI RRESP, or Read Response, field.
Meanwhile, in a Write Flit, each bit is accompanied by the WSTRB, or write strobe field,
which is of varying size according to the AWSIZE. This leads to different results when it
comes to the number of beats which can fit in a flit.
W Flit
The structure of the W Flit can be seen in Figure 3.7.

Figure 3.7: Write Flit

The fields are similar to the R Flit, which are explained above.
B Flit
The structure of the B Flit can be seen in Figure 3.8.
Besides its Standard Header, it carries the 2-bit BRESP write response. Most of the
space in the flit is unused. At the beginning of the flit design process, there were discussions
on combining different Write Responses in one flit, but ultimately the idea was abandoned
because it would lead to a large delay for individual B Flits as well as unnecessary design
complexity.

41
Figure 3.8: Write Response Flit

Flit Construction Blocks


In the system design, two primary modules are employed to manage and construct flits for
communication between components: the Flit Construct Slave and Flit Construct Master
modules. These modules are integral to the system’s functionality, handling distinct aspects
of data flow and protocol management.
Finite State Model (FSM) for Flit Creation:
In both the Flit Construct Master and Flit Construct Slave modules, the finite state machine
(FSM) logic orchestrates the process of flit generation and transmission with precision. Ini-
tially, both modules enter an idle state, where they remain until valid data is available in the
AXI buffers and the transmission buffer is empty, indicating readiness for data transport. Once
these conditions are met, the module transitions from the idle state to initiate flit generation.
This is achieved by sending control signals to the Transaction Table module, which provides
the necessary tag information for the next stage of the FSM.
In the subsequent flit tag stage, the module incorporates the tag information received from
the Transaction Table, along with additional details from the AXI interface, such as credit
information. This comprehensive data set is then used in the flit creation state, where the
final flit is constructed. The completed flit is subsequently transferred to the transmission
buffer, which interfaces with the D2D interconnect for further data routing. This FSM-based
approach ensures that each step in the flit construction process—ranging from data readiness
checks to the final transmission—is methodically executed, optimizing the overall efficiency
and reliability of the data transfer system.
AXI Slave interface Flit Construct:
The Flit Construct Slave module is dedicated to managing and generating flits from signals
originating at the AXI slave interface, specifically addressing write and read transactions.
This module is essential for converting AXI protocol interactions into a format suitable for
interconnect communication.
For write transactions, the module generates AW (Address Write) and W (Write Data) flits.
The AW flit encapsulates address and control information related to the write operation, while
the W flit carries the actual data to be written. In the case of read transactions, the module
constructs AR (Address Read) flits, which contain information necessary for initiating and
managing read operations. In summary, the Flit Construct Slave module effectively translates
AXI slave signals into structured flits, including AW, W, and AR, thereby enhancing the
system’s data management and communication efficiency.
AXI Master interface Flit Construct:
The Flit Construct Master module is dedicated to managing and generating flits from signals
originating at the AXI master interface, specifically addressing write response and read data
transactions. This module is crucial for converting AXI protocol interactions into a format
suitable for interconnect communication.
For write transactions, the module generates B (Write Response) flits. The B flit en-
capsulates response status and related control information regarding the completion of write
operations. For read transactions, the module constructs R (Read Data) flits, which carry

42
the data read from memory along with necessary metadata. In summary, the Flit Construct
Master module effectively translates AXI master signals into structured flits, including B and
R, thereby enhancing the system’s data management and communication efficiency.
In conclusion, the Flit Construct Slave and Flit Construct Master modules are pivotal to
the effective management of data flow within the system. They handle the detailed aspects of
flit construction, including formatting, tagging, and synchronization, thereby contributing to
the overall efficiency and functionality of the system.

Flit De-Construction Blocks


The Flit De-construction logic is FSM-based similar to the Flit Construction Block. During
the Deconstruction process, we have the luxury of splitting the work into four separate, parallel
paths, mirroring the 4 separate groups of buffers in the Receiver Engine, as shown in section
3.2.3. Below is an explanation of the four Flit Deconstruction blocks:
• AR Flit Deconstruction, is a simple 2-state FD block- In the first state, Idle , it
awaits a signal from the Receiver to know that the AR buffer has a valid flit to send.
After it receives the data and stores it in the appropriate registers for the different AXI
Address Read signals. it goes into the second state, Send. In send it awaits for a
handshake with the AXI Master Read Block to be performed, after which data is sent
to that block successfully, and the state can go back to Idle.
• R Flit Deconstruction has 4 states: Idle, one for receiving a new flit from one of the
Receiver’s R buffers, one for Sending the bits enclosed to the AXI Slave Read Block, and
one last state for sending the last beat of the whole transaction during which RLAST
has to be raised. The logic behind this block is that once a flit is received from the
Protocol Engine/Receiver, it also gets the ArSIZE and ArLEN from the corresponding
R buffer. With that information, it checks a LUT to see how many beats can be in a flit,
which we calculated and presented in Table 3.2. The block is shifted so that the data is
read correctly from the buffer that stores the received flit. Arlen is also decremented so
that we know when to stop sending flits in case the last flit does not have the maximum
number of beats. The FD block applies back pressure to the R buffer in the Receiver
Engine.
• WAW Flit Deconstruction works similarly to the R Flit Deconstruction, but with
the difference that we transmit two types of flits: AW, and W, to the same AXI Master
Write Block. The AW flit is the first flit we receive from the WAW buffer in the Protocol
Engine, so once it is received by the FD block it must be transmitted first before getting
the rest of the W flits. Afterwards, the AwLEN and AwSIZE data that we need for
transmission of W Flits is given by the AW flit and used in a similar way to shift the
data.
• B Flit Deconstruction works similarly to AR Flit Deconstruction. It only takes one
cycle to store the ID field of the B Flit’s standard header along with the 2-bit BRESP
Write response, and one cycle to transmit them with the appropriate handshake from
the AXI Slave Write Block.

3.2.3 Protocol Engine

The Protocol Engine component of the Offload Engine is the part which communicates
with the Die-to-Die Adapter. It has to partake in the Initialization flow and halt the trans-

43
mitting/receiving process when the Adapter requires it. It must also allow the AXI engine
to seamlessly transfer data to the AXI bus, delivering the data from FDI to it without inter-
ruptions and in the correct order. We will present the mechanisms which ensure these in the
following sections.
Protocol Engine components are grouped into three different categories: the Transmitter
group, the Receiver group, or a third category which consists of blocks used by both incoming
and outgoing flows. In this category belongs the Credit Handling, Transaction Table, and
Protocol Control Block. We will look at all of these separately.

Protocol Control Block

It is only fitting that we start with the Protocol Engine’s brain, the Protocol State Machine,
carefully enclosed in the Protocol Control Block. The Protocol Control block is connected with
all of the FDI’s signals, except the main data path and its handshake pins, which are connected
to the Transmitter and Receiver. However, not all signals are driven or used. For example,
the sideband is not in use and is outside of the scope of this thesis. Documentation of the FDI
signals which are relevant to the design will first be presented, along with a short explanation.
Then we shall discuss the Protocol State Machine. The signals are defined in the following
Table 3.5 and Table 3.4.
Now that all input and output signals have been defined, we can present the Protocol State
Machine. The PSM has been implemented to mirror and comply with the FDI’s state machine,
as was previously shown in Figure 2.15. Fully complying with the specification, all transitions
we implemented in the PSM involve the appropriate signals. In this work, we will not delve
into all state transitions, but rather only focus on the Initialization Flow. The Protocol State
Machine can be seen in Figure 3.9, including all the signals involved in the different transitions.
1
We do not implement clock gating for low power, however, do acknowledge this, since it is recommended by
the spec: Implement the handshake logic in your Protocol Layer even if you don’t perform clock gating; When
receiving pl_clk_req, simply acknowledge it with lp_clk_ack after one clock cycle. This approach ensures
compliant behaviour with the Adapter and avoids potential errors.

44
Signal name Signal Description
pl_trdy Signifies that the Adapter is ready to accept data. Is used in a
handshake for the Transmitter.
pl_valid Signifies that the Adapter is sending valid data on pl_data. Is
connected to the Receiver of our engine.
pl_data [NBYTES- Adapter to Protocol Layer data. NBYTES determines the width
1:0][7:0] of the FDI interface. NBYTES is directly related to our design
parameter FLIT_CHUNK.
pl_flit_cancel Signifies that the flit should be dumped by the Protocol Layer.
This comes one cycle after the first transmission on the FDI, which
is why we keep pl_data in an intermediate buffer.
pl_stream[7:0] Adapter to Protocol Layer that indicates the stream ID to use with
data. It has 8 encodings currently in use by UCIe, and each stream
ID value maps to a different protocol for Stack 0 or Stack 1. We
are only concerned with Streaming Protocol, so depending on the
connected Stack the value should be 04h or 14h.
Shows the state of the Interface state machine. The encodings are:
000b: Reset
0001b: Active
0011b: Active.PMNAK
0100b: L1
pl_state_sts[3:0]
1000b: L2
1001b: LinkReset
1010b: LinkError
1011b: Retrain
1100b: Disabled
pl_protocol[2:0] Adapter’s indication of the protocol that was negotiated during
training. The value we are expecting is 0111b: Streaming protocol
without Management Transport.
pl_protocol_flitfmt[3:0] Signifies the negotiated Flit Format. The value we are expecting
is 0001b: Raw Format.
pl_protocol_vld Indicates that pl_protocol and pl_protocol_flitfmt have valid in-
formation. When this is high, those signals must be stored. This
signal is also a part of our PSM’s initialization process.
pl_inband_pres Signifies that the Die-to-Die Link has finished parameter negotia-
tion. Part of the Initialization process.
pl_stallreq Adapter request to the Protocol to Flush all flits and not prepare
any new Flits. Accompanies a transition to a low-power state
triggered by the Adapter.
pl_wake_ack Signifies that clocks have been ungated in the Adapter. Is part of
the Initialization process.
pl_rx_active_req Signifies that the Protocol should open its Receiver Path to receive
new Flits. Is only valid within the Reset, Retrain, or Active state.
pl_clk_ack Request from the Adapter to the Protocol Layer to remove clock
gating from its logic 1 .

Table 3.4: Signals from the Die-to-Die Adapter to our Protocol Engine

45
Signal name Signal Description
lp_valid Signifies that the Protocol Layer is sending valid data on lp_data.
Is generated from the Transmitter of our engine.
lp_irdy Signifies that the Protocol Layer potentially has data to send, is
asserted together along with lp_valid.
lp_data [NBYTES- Protocol Layer to Adapter data. NBYTES determines the width
1:0][7:0] of the FDI interface. NBYTES is directly related to our design
parameter FLIT_CHUNK.
lp_stream [7:0] Protocol Layer indicates to the Adapter the stream ID to use with
data. We have tied this signal to 04h: Stack 0, Streaming Protocol.
Protocol Layer request to Adapter to change state.
The encodings are as follows:
000b: NOP
0001b: Active
lp_state_req[3:0] 0100b: L1
1000b: L2
1001b: LinkReset
1011b: Retrain
1100b: Disabled
lp_linkerror Protocol Layer to Adapter indication that an error has occurred
which requires the Link to go down, essentially requesting the state
to change to LinkError. We have implemented the logic for our
design but do not define a use case for it.
lp_rx_active_sts Response to pl_rx_active_req to show the Receiver is enabled.
lp_wake_req Request from the Protocol Layer to the Adapter for it to remove
clock gating from its logic.
lp_stallack Response to pl_stallreq. Is asserted during the transition to a
low-power state.
lp_clk_ack Response to pl_clk_ack.

Table 3.5: Signals from our Protocol Engine to the Die-to-Die Adapter

46
Figure 3.9: Protocol State Machine

For the Initialization flow, we have defined 3 states which aid in the process and are defined
by the handshakes and signal exchanges that have to be completed before FDI and our engine
can enter the Active state. During the Active state, the Transmitter and Receiver are sending
and accepting flits respectively. The Initialization Flow is as follows:
1. Adapter notifies the Protocol Layer that parameter exchange has finished
by asserting the signals pl_protocol_vld and pl_inband_pres. This is the beginning of
Part 3 of Stage 3 of the entire flow we explained in 2.5.3. These assertions trigger the
PSM to enter the R_Active_Entry_Wake state in which we raise lp_wake_req and
request an Active state on lp_state_req.

47
2. Adapter has removed clock gating after our request. This is indicated by the
assertion of pl_wake_ack which triggers our PSM to enter R_Active_Entry, during
which it waits for the Adapter to reflect Active in pl_state_sts.
3. Adapter and PSM are both in Active state after pl_state_sts has switched to
active. In this, credit flow control, transmitting and receiver are enabled in the Protocol
Engine.
At the end of this flow, known as the FDI Bring up Flow, our engine is in Active mode
and is accepting and receiving flits to and from the Adapter.

The Transmitter

The existence of the transmitter buffers is inspired by PCIe’s similar transmit structure.
These buffers hold flits until it is permitted to transmit them as a way to alleviate data loss.
The signal that enables transmission is a combination of a transmit enable from the PSM, and
a transmit enable from the Credit Handling block presented in section 3.2.3. There are two
buffers in the Transmitter, which have their respective Credit Handling Blocks.
Transmit Buffer TX_S, connected to the AXI Slave interface Flit Construction Block.
This Buffer stores AW, W, and AR flits. The total size of this buffer is equivalent to the
total size of the AR, and WAW Buffers in the Receiver of the Offload Engine in a symmetric
connection: a symmetric connection is the scenario where our Offload Engine is connected to
an Offload Engine with the same buffer sizes and tag spaces. This is the only scenario which
we tested. However, in a theoretical asymmetric connection, the transmit buffers would have
to reflect the size of the remote Offload Engine’s Receiver buffers.
Similarly, Transmit Buffer TX_M is connected to the AXI Master Interface Flit Con-
struction Block. In this buffer, B and R flits are stored.
TX_M_SIZE = R_RECEIVER_BUFFERS*MAX_R_TRANSACTION_SIZE +
B_RECEIVER_BUFFER_SIZE

TX_S_SIZE = WAW_RECEIVER_BUFFERS*(MAX_W_TRANSACTION_SIZE+1) +
AR_RECEIVER_BUFFER_SIZE

For the cases we tested, the sizes that were used were 516 and 520 flits wide respectively.
This was to reflect a maximum transaction size of 128 and a tag space/transaction throttle of
8.
The Transmitter is connected to the FDI. In our design, we were working with an FDI of
128 to accommodate the raw transmission of two 64-byte flits simultaneously. The Transmitter
will signify to the FDI that a new flit is to be transmitted whenever either Tx_S or TX_M
has flits to send and transmission for them is enabled. If one transmission buffer is to transmit
and the other isn’t, the empty half would be filled with a DEFAULT_DATA field which in
turn would be detected by the remote Offload Engine’s Receiver and discarded (similar to
UCIe’s NOP flit). This emphasizes the symmetrical connection between two Offload Engines.
The handshake between the Transmitter and the FDI is defined by UCIe specification and
can be seen in Figure 3.10.

48
Figure 3.10: Data Transfer between the Transmitter and the FDI, or the Protocol Layer and
the Adapter as shown in UCIe Specification [26].

In our implementation, lp_irdy is enabled when the Offload Engine is in an Active state
and lp_valid when the Transmitter has valid data to transfer. Once pl_trdy is asserted and
the Adapter has accepted the data, the respective Transmit buffers are updated and so is their
Credit Handling.

The Transaction Table

The need for a transaction recording mechanism arose from the AXI specification. In AXI,
two transactions with the same ID may be of a completely different type with a completely
different destination. The lack of restrictions on the AXI ID led us to the construction of the
Tagging System.
The Tagging System’s purpose is to keep track of completed transactions. It is compliant
with the fact that we have a limit to the number of outbound transactions the AXI engine can
manage. From AXI, the only identifier we have for a transaction besides its type is the ID -
which be used by different transactions. There might be a conflict between the transactions
that are being sent from our engine, and also between the transactions that are being sent
and received. To solve this issue we introduce the notion of a tag, a unique identifier for each
transaction sent through the UCIe link. As we saw in Figure XX, every type of flit has the
tag bit field in its header.
We use the term Transaction Table as a storage for the information of our outbound and
inbound transactions.
In this table, we store:
• Tags of outbound and inbound transactions
• IDs so that transactions from inbound AXI can be identified and tagged
• ArSIZE, ArLEN so that R transactions can be properly decoded or encoded
When a transaction is received, it will be checked on the Transaction Table before it is
accepted to the Rx buffer. The number of Rx buffers is also equal to the number of distinct
tags an offload engine can support which is equal to the number of outbound transactions an
offload engine can support. The tags are used for addressing the transaction table. This is
chosen so that searching in the table does not take multiple clock cycles.
The table is partitioned into four sub-units of storage, and an entry to the transaction table
goes into one of these units depending on if it is inbound or outbound, and Read or Write.

49
The sub-units are then defined as follows:
• Tx_R, for outbound Read transactions
• Tx_W, for outbound Write transactions
• Rx_R, for inbound Read transactions
• Rx_W, for inbound Write transactions
When a transaction is completed, either by AXI data arriving to the Flit Construction
block or by a flit arriving to the Receiver, an entry is erased by setting its dirty bit to 0 so
that slot, therefore that tag, can be freed and used again.
The architecture and form of the Transaction Table was conceived by the need to keep
separate storage and tag spaces for Write and Read transactions so that there is no overlap
between their tags. The tag space of one Offload Engine is mirrored in its remote link partner.
An example of different transactions and their completions is shown in Figure 3.11.

50
Figure 3.11: 3-stage Tagging System Example
51
In this example, we show how the system handles multiple transactions with the same ID,
and how they are stored and completed in the Transaction Table.

Credit Handling

In two instantiations of the same block for the two distinct Transmit buffers, we implement
Credit Flow Control for our Offload Engine. We follow methodologies used in PCIe [30],
which we will explain below.
Every Credit Control block has a register which counts Credits Consumed. This shows
the number of credits consumed in the remote Engine’s Receiver Buffers, or in other words,
is an estimate of how much data we have sent to that buffer and is stored in it. In a more
complex system where packets are of varying sizes, they can consume a different amount of
credits. But here since our flits are all of the same size, a credit to flit ratio of 1 is sufficient.
Thus, during reset, the credits consumed counter is initialized to 0 and its maximum limit is
TX_M_SIZE or TX_S_SIZE as described in the Transmitter Section 3.2.3. Before a flit is
sent to the FDI, a calculation is made:

cumulative_credits_required = credit_limit−(credits_consumed+1) mod (2CREDIT_FIELD )

Cumulative Credits are an updated estimation of the remote partner’s credits after trans-
mission. CREDIT_FIELD is the credit field’s size parameter. The next inequality check is
what enables transmission for the Transmit Buffers:

2CREDIT_FIELD
cumulative_credits_required ≥
2
If it holds, transmission is enabled for the associated buffer. The right field of the inequal-
ity is not 0 because while the subtraction for the cumulative credits is implemented in 2’s
complement, the comparison is unsigned. If the result for the cumulative credits is negative,
the inequality will not give a proper result.
The Credit handling block holds another counter for Credits Allocated: this is a real
representation of the credit space in the Offload Engine’s Receiver buffers. It is initialized
to either TX_M_SIZE or TX_S_SIZE. Whenever a flit is received from the FDI of the
appropriate type (AW/W/AR or R/B), this amount is decremented. Whenever a flit exits
either the AW/W/AR or R/B buffers and is sent to Flit Deconstruction, the associated Credits
Allocated counter is incremented. This value is available to Flit Construction and is placed
in the Credit field found in the Standard Header of all flits. Outgoing R/B flits will have the
transmitting Offload Engine’s accumulated credits of Credit Control M and AW/W/AR flits
will have that of Credit Control S. When received at the remote Offload Engine they will be
used to update that Engine’s Credits Consumed. This is how that estimate is updated and
remains true to the actual value of the remote partner. In PCIe, credit update is done through
messages. For UCIe it could be done through the sideband, but we do not utilize it in this
implementation.

52
The Receiver

The Receiver is the largest block of the entire Protocol Engine. We will explain the function
of it by going through the different stages that A flit goes through upon arrival from the FDI.
• Flit Arrives and is stored in the Intermediate Buffer. This allows the Adapter
to cancel a flit. Flit cancel happens when a piece of the flit does not pass the CRC check
in the Adapter, and the signal is asserted after the flit chunk has been sent on the FDI.
An example provided by the UCIe spec is shown in Figure 3.12 where the second half
of the latter two 64-byte flit chunks do not pass the CRC check. The flit cancel signal
is asserted again for the first half of the flit to avoid sending the first half twice. This
mechanism is the reason we store the FDI transmission in an Intermediate buffer.
• Flit is separated into M and S for Transaction Table Check. The flit is divided
between AW/W/AR and R/B flits as they were transmitted by the remote Offload
Engine and stored in two intermediate buffers. Their tags are sent to the Transaction
Table so a check can be made: if the received flit is a completion to previously transmitted
flits then the Transaction Table will be updated. Otherwise, if the incoming flit’s tag
is not in use, it can be recorded in the Transaction Table’s inbound transactions. If it
does not fit either criterion, the flit has to be discarded. In this stage, the data in the
intermediate buffers is also checked for validity by comparing it to DEFAULT_DATA
as we explained in 3.2.3.
• Transaction Table Check Done. Look-up in the transaction table takes one cycle.
If the Flit is valid in the tag space, it is allowed to pass through to the Receiver Buffers.
In the case of an incoming R Flit, the Transaction Table also returns the ArSIZE, and
ArLEN so that they may be stored alongside the data in the buffers and used in Flit
Deconstruction to decode the flits.
• Stored in Flit Buffers. We define 4 types of Receiver Buffers:
– AR Buffer. Stores incoming AR flits. It is implemented similarly to the Tx buffers,
as a FIFO (First-In-First-Out).
– WAW Buffers. Store Incoming AW and W flits. The number of WAW buffers
is defined by the tag space: each WAW buffer corresponds to one tag value. The
decision to store the AW Flits along with the corresponding W Flits was to maintain
reliability. Once the transmission of W data starts towards the AXI engine, it may
not be interrupted. To maintain this rule we implemented the following mechanism:
A WAW Buffer is of MAX_W_TRANSACTION + 1 size. In address 0, the AW
with the specific tag is stored. Then, the incoming W Flits are stored in an address
according to their Sequence field. Thus, the whole transaction is assembled in order.
This is to combat any re-ordering that may happen in the Link, but also so that
the transaction is pushed out properly.
– R Buffers similarly store the R transaction, utilizing the Sequence field to keep
the order of transmitted data. Alongside the data, ArLEN and ArSIZE are stored.
– B Buffers store Write Response Flits, implemented as a FIFO.
The R and WAW Buffers operate with an FSM. In one state, they are filling up and
assembling the transaction. When the transaction is completed, they go into a pushout

53
Figure 3.12: Flit Cancel example [26].

state: during which they must push out their data to the corresponding Flit Deconstruc-
tion Block, uninterrupted.
• Flit Deconstruction and Arbitration. For AR and B Flits, it is fairly simple to
connect them to their corresponding FD blocks. For R and WAW Buffers, which are
multiple according to the tag field, arbitration must take place. During every cycle in
the Receiver Engine, the Buffers are checked to see if they have completed transactions.
If they do, they will claim the appropriate FD. The check is done with numerical tag
priority, starting from tag 0. This is because the tags are similarly assigned by the
Transaction Table, so the priority is mirrored here.
This concludes the function of the Receiver Engine as flits exit the block and the Protocol
Engine altogether, making their way to Flit Deconstruction and ultimately to the AXI blocks.

54
3.3 Challenges

In this section, we will discuss the design challenges we encountered and the design choices
implemented by looking at early iterations of the major aspects of the design, the Offload
Engine architecture and the AXI-based flits.
The early iterations of the offload engine focused solely on bridging the functionality of
the offload engine with the AXI protocol framework inside the System-on-chip (SoC) and the
Universal Chiplet Interconnect express(UCIe).

3.3.1 Architecture Development

We will first, focus on the AXI interface, The AXI blocks were developed crudely to cater
to the AXI channel signals from the AXI bus. Both the AXI Master interface and the AXI
Slave interface shared the same AXI FSM handshaking mechanism with a Configurations
Space block that managed to coordinate between the two interfaces. The challenge with
this design was the not utilizing the AXI4 parallel address/data channels for both
the write and read transactions. The iterations of the AXI blocks focused on developing
separate AXI interfaces with independent channel AXI handshaking mechanisms as the AXI4
protocol demanded changes from the design, where the protocol states that the write and
read transactions have separate address, data and response(in the case of write transactions)
channels that synchronize there signals but operate independently to increase the flexibility,
scalability and to optimize latency. The creation of AXI Channel buffers was an addition to
the early iteration of the design to maintain the parallel processing logic generated by the AXI
interface, where the transactions are stored and maintained in the buffers till they are moved
to the Flit Construct (FC) block where the AXI-based flit is created to be pushed out to the
FDI(Flit Aware D2D interface). These changes on the AXI interface were crucial in moving
the design closer to comprehensive AXI4 protocol support.
The Protocol Engine has undergone significant change in its development through its several
iterations from the initial one seen in Figure 3.13. The general structure of this protocol engine
was clear early on, borrowing from pre-existing architectures such as PCIe. One of the main
challenges was limiting area by reducing buffer sizes, only keeping that which was necessary
while maintaining a coherent structure, meaning, the credit flow control had to reflect this
size. Beyond this, we were called to solve problems unique to our design case. The AXI
protocol requires data to be delivered in a stream that may not be interrupted. To overcome
this challenge, we constructed two mechanisms: the Transaction Table and the architecture
of the Receiver Buffers. The Transaction Table ensures the completion of AXI transactions,
while the indexed nature of the Receiver buffers makes sure the stream of data towards the
AXI is uninterrupted. Separation of Receiver Buffers according to the type of flit content also
aided in this solution. The reader may refer back to Figure 3.1 and compare it to the initial
design to get a sense of the requirements needed to be met.
The Flit construct/deconstruct blocks are the final block that makes up the offload engine.
Both the blocks were designed keeping in mind the single input from the AXI interface side
and the protocol side. The development of the AXI interface and the protocol engine as
mentioned above created challenges in reducing the parallel processing logic for the
Flit Construct/Flit Deconstruct. This was resolved by creating multiple instances of the
blocks and increasing the data path towards the D2D interconnect on the FDI bus.

55
Figure 3.13: Initial Offload Engine Design

3.3.2 Flit Development

Figure 3.14: Initial Flit Architecture

The AXI-based flits were initially developed keeping in mind the types of AXI transactions,
the credit system for receiver and transmission buffers which is where the virtual channel (VC
ID) was added to the flit structure. The challenge was managing the overhead to payload
ratio as well as the lack of structure which we had to innovate upon. In the first generation
of flits, we had an absence of a unified standard or AXI field, which helps the decoding
aspect of the Flit. We overcame these challenges by changing and adding fixed standard and
AXI headers into the flits. Furthermore, the addition of the transaction table helped add more

56
dynamics to the flit like adding the Start/stop bit and a tag field.
It was also realized that the sequence number of a flit had no explicit correlation to the
AxLEN. At this time, we made the calculations necessary, given in Table 3.2. After establishing
a strict Standard Header size, we made the appropriate trade-offs within the header to finalize
the flits shown in section 3.2.2.

57
Chapter 4

Results

In this chapter, we will first discuss the objective that we are trying to achieve through our
comprehensive verification process. We will then go through the detailed process of verification
from the block-level to system-level verification methods. Additionally, we will present the
supportive scripts which aided in the verification process. This overview aims to explain to
the reader the methods which were used to examine and complete the design.

4.1 Verification

The Offload Engine serves as the critical integration unit between the System on Chip
(SoC) and the die-to-die package playing a pivotal role in the chiplet architecture. In the
larger context of the chiplet architecture, the Offload Engine is responsible for replicating
signals from the AXI bus on one side to the other side within a different chip, ensuring
seamless communication and data transfer across the chiplets.
To validate the functionality and performance of the Offload Engine, we adopted a multi-
tiered verification approach, starting with block-level verification and culminating in system-
level verification.

4.1.1 Block Verification

The two distinct domains of the Offload Engine led to diverse verification methods through-
out the whole design. Below we present the different methods used for verification in these
two domains, AXI and Protocol, as well as the bridging FC and FD modules.

AXI Engine Domain


AXI Engine is structured in a nested format where the AXI Master interface and AXI Slave
interfaces contain their respective internal blocks within. The read and write modules for the
respective AXI Master/Slave interfaces are integrated with their AXI channel buffer for data
acclimation from the AXI side before it is moved towards the Flit Construct (FC) block for
flit constriction. Therefore the blocks inside the interface are hugely interconnected which
needed a step-by-step block verification to ensure that the module is robust and data integrity
is maintained through the various blocks present.

58
In the beginning, an AXI block-specific test bench where developed to test the AXI core
blocks. Advanced eXensible Interface (AXI) is a complex communication framework within the
SoC which communicates between multiple subsystems. To extensively test the AXI protocol,
a set of distinct test cases must be implemented to cover all the corner cases. The following
cases were verified for the distinctive write and read transactions:
• Single Address and single beat data transfer
• Single address and multiple beat data transfer (Burst mode)
• Multiple addresses and multiple beat data transfer
• Exhaustive transaction transfer to verify the outstanding transaction logic
These test benches exhaustively verify the different transactions that can take place in write
and read channels and also the accuracy in timing for the AXI handshaking mechanism.
The AXI channel buffers are the next unique block in the AXI Engine, which are verified
to correctly intake and maintain the AXI address/data information in the respective channel
FIFO buffers. The data integrity checks are ever so important when it comes to the read
data and write data buffers as they are an array of FIFO buffers that are being tracked by
an external InUse buffer. Thus the testbench caters to implementing the multiple write and
multiple read operations one after the other as well as simultaneously. For not losing parts of
data information when in case of burst mode the logic can only access the specific data once
the last beat is stored in the buffer thus ensuring no data loss while read operation.
The next stage test benches encapsulate AXI Read/Write blocks and AXI buffers for AXI
Master/Slave Interfaces to verify the AXI flow from the AXI bus interface till the Flit Construct
block. These test benches were very important to correctly send the huge number of AXI
signals through the block and the design to be able to maintain and store information in the
respective channel buffers with no external stimuli in the AXI logic.
The final AXI Engine stage testbench encapsulates the AXI Read/Write blocks, AXI chan-
nel buffers and the Flit Construct(FC) blocks together to generate an end-to-end AXI signal
encapsulation from AXI signal to AXI flit. This helps to understand the credibility of the data
transfer through the FDI for D2D interconnects like UCIe.
Testing the various core AXI blocks and building the testbench up to the AXI interface
significantly improved our understanding of how all the blocks function together, leading to
the creation of a robust AXI interface module.

Protocol Engine Domain


A large number of blocks in the Receiver engine have a Flit, or FLIT_CHUNK sized data,
as their input. So from the very beginning, it was evident that supporting scripts would be
needed to generate this type of binary data for us to feed it into the individual blocks. We
shall begin by explaining the relevant script after which we will focus on blocks in the same
way they are presented in section 3.2.3, in the order that we verified them.
Flit Stimulus Generator
An elaborate Python script was necessary for us to generate flit data for verification. In
flit_stimulus_generator.py we define classes for the Standard Header, and the AXI Header,
and with these ultimately construct a Flit class. All the fields seen in out flits in subsection
3.2.2 are class attributes which can be assigned. The class has a function named binarify ()
which translates an object’s attributes into binary code based on their AXI encoding. The

59
data of an R or W flit is stored in the format of a list of their beat values, which also are
translated into binary code in binarify() based on the flit’s AxSize. Utilizing the above, we can
make multiple transactions of any type. Figure 4.1 it is shows how the script data is generated
and then used in one of our test benches.

Figure 4.1: How Flit stimulus is generated and used in block level verification

The script can output FDI length or 128bytes of data for verifying the Receiver Engine with
DEFAULT_DATA padding if necessary. It can also scramble the order of the flits to simulate
data scrambling on the UCIe Link. The script can also output FLIT_CHUNK or 64byte
size data for verifying the Transaction buffer, or Flit Deconstruction blocks. Throughout the
verification process, the flit stimulus generation script became highly versatile.
Verifying the Receiver
Verifying the receiver meant verifying all the steps explained in section 3.2.3. Initially, when
the Transaction Table had not been verified yet, it was simulated via a task on the Receiver’s
testbench. Using the flit stimulus generator script, the following cases were verified:
• Two consecutive Write Data Transactions of a varying number of Flits
• Two consecutive Read Transactions of a varying number of Flits
• Multiple B and AR transactions
• Multiple Read, Write, B and AR Transactions
• Multiple Read, Write, B and AR Transactions Scrambled
The payload of the Read and Write transactions was simply an incrementing integer, which
started from 0 and ended with AxLEN- this helped us recognize the proper function of the
Receiver Engine. Gradually building up to more complex cases helped in verifying the individ-
ual blocks of the Receiver, namely the Receiver buffers. Though not verified on the industrial
level, we can now confidently say that the Receiver is equipped to handle multiple transactions
of any type.
Verifying the Transmitter
During the process of verifying the transmitter it was important to see two things:

60
• Proper Function of the Transmission Buffers, which were implemented using a common
block called Flit_Buffer.sv as a FIFO. This block is also used in the Receiver Engine for
AR and B Flits.
• Proper assertions of control signals to the FDI and Credit Handling
Once these were ensured, we could move on to the next block, and later on to the verification
of the whole system. For this purpose, we proceeded straight to the multiple-type flit case we
used in the Receiver which proved the correct functionality for the block.
Verifying the PSM
For the Protocol Control Block, we focused on the Bring Up Flow, thus, it was necessary
to drive the PSM properly and according to the UCIe Specification as an Adapter would.
Complying with the correct signal transitions as described by UCIe we were able to see the
PSM transition from Reset to Active state and assert the appropriate signals.
Verifying Credit Handling
For this Block, it was important to see:
• credits being exhausted, and transmission being disabled.
• Proper updating of the Credits Allocated from a remote partner
A simple task-based testbench was needed to toggle the control signals. During this process,
we also evaluated the equations brought from the PCIe specification and understood their
operational details. The relation between credits allocated, credits consumed and the size of
Transmission Buffers was also evaluated. In an older Flit design, where the CREDIT_FIELD
parameter was 8, it was not sufficient for the size parameters we were working with. Thus we
decided to change it to 12, reducing the tag field. After all the changes, it was confirmed that
the Credit Handling blocks functioned properly.
Whole Protocol Engine Verification Once all the above blocks were verified, the entire
Protocol Engine was put together with the FD to ensure proper connections for the particular
domain. From FDI to FD, output was verified. This made for a smoother transition to whole
System verification.

Cross-Domain
Verifying the Transaction Table
While the transaction table was still being developed, a simulation was needed to truly
understand the different scenarios and the structure of the tagging system. We developed a
transaction emulator to aid us in the development of the Tagging System, which can be seen
in Figure 4.2.

61
Figure 4.2: Transaction Emulator Script

The Tagging System mechanism, along with the Transaction Table architecture we arrived
is the one shown in Section 3.2.3. To verify this block we developed a testbench where differ-
ent transactions were tasks, asserting the appropriate control signals that were input to the
Transaction Table. What we focused on seeing was:
• Different transactions are being recorded on the Tx tables, and the Transaction Table
outputting the tag for the FC
• Different transactions being recorded on the Rx table, and the Transaction Table out-
putting the correct information for the Receiver
• The completion of transactions from both ends, either by inputting an ID from the FC
or by giving a tag from the Receiver
The one-cycle operation of the Transaction Table was also crucial, so we took care during
the design by implementing the hardware to execute in parallel so that it remained true.
Verifying The Flit Deconstruction Blocks
For each Flit Deconstruction Block, a testbench was created to verify the different opera-
tions of the block. For the AR and B FD blocks it is fairly simple to test the transmission of
multiple AR and B flits back to back to the AXI block. Again, the flit stimulus is used. For
the WAW and R blocks, we have to ensure the proper state transitions. Flit stimulus is again
used, taking advantage of previously generated cases for the Receiver engine. Once the proper
function of FC is ensured, we can move on to system integration of all domains.
Verifying The Flit Construction Blocks
Flit Construction blocks are divided into Master/Slave FC blocks for the respective AXI
Interfaces. The Flit Construct block operates on a Finite State Machine mechanism, where
the states are IDLE, Flit-Initiation, Flit-Tag and Flit-creation, the functionality of the states
is explained in detail in the method chapter. The testbench to test the respective Flit con-

62
struction generates the stimulus for all the 5 channel AXI buffers and generates the correct
Tag stimulus from the Transaction table during the Flit-tag phase of the code to functionally
verify the Flit Construct blocks comprehensively.

4.1.2 System Verification

After all sub-systems were verified using the methods previously described, it was time
to integrate the system and build the whole simulation environment. We decided that since
the function of the Offload Engine was symmetrical, it would be sensible to connect two
instantiations of the Offload Engine in this environment. To simulate the Adapter through
the FDI, a Die-to-Die Adapter module is also necessary to drive the UCIe control signals. The
whole structure can be seen in Figure 4.3.

Figure 4.3: The testbench used for whole system verification.

An explanation of each module follows.

AXI Driver

Different AXI channels can be simulated using tasks in the encompassing test bench. Each
Transaction is a toggling of the appropriate signals that are connected to the AXI Domain of
the Offload Engine, and in this way, we can drive them from Device 1 and see them at Device
2’s output. A simulated response can then be driven from Device 2’s AXI Interface.

Bring-Up Flow Driver and FDI Data Driver

For the Bring-Up Flow Driver, a technique from the PSM block verification is used.
The bring-up flow sequence is:
• Clock cycle 1: pl_inband_pres and pl_protocol_vld are asserted.
• Clock cycle 2: pl_wake_ack and pl_rx_active_req are asserted.
• Clock cycle 3: pl_state_sts is set to 0000b, the value for Active.
This follows the rules for UCIe FDI bring-up flow, defined in the specification [26]. This
Driver is connected to both devices and initiates the flow for them simultaneously.

63
In the FDI Data Driver component of the Adapter Simulation Module, we need to drive
the data handshake signals for the Transmitter and Receiver of both devices. To simulate the
link, we include two FIFO buffers: one which stores flits from Device 1 to 2, and the other
which stores flits from Device 2 to 1. Once a Flit enters either FIFO, the module sends the
Flit to the receiving device by mounting it on the corresponding pl_data bus and raising the
associated pl_valid. The module will keep accepting flits from either engine so long as the
respective FIFO buffer is not full.

4.1.3 Results

In this work, we focused on the functionality of the design and tested the elaborate mech-
anisms we developed for the architecture. For this purpose, a simulation of the design is
sufficient. Power and Area metrics may be obtained from the synthesis of the design, but this
is a long and arduous process which goes beyond the scope of our work. However, we can
present the latency and throughput of the Offload Engine.

Latency and Throughput

The latency of the Offload Engine may be measured by the simplest type of completion
which is an AR-R completion, or the transmission of an AR transaction from Device 1 to
Device 2 and the R data response from Device 2 to Device 1. The results of this simulation
can be seen in Figure 4.4.

Figure 4.4: AR-R System Simulation Results

FDI Bring-Up takes 3 cycles. From then on, an AR transaction is mounted on the Slave
AXI AR channel of device one.
• Within 10 cycles it passes through Device 1 and is given to its Adapter.
• After arriving at PL_DATA of Device 2, it is shown on the AXI output after 12 cycles.
Device 2 then receives the R DATA on its AXI Master Read. It consists of 5 beats.
• In 16 cycles it passes through Device 2 and is sent on the FDI. If we do not account for
the 4 cycles it takes for the multiple beats of the transaction, the latency of this stage is
12 cycles.

64
• Upon arrival from the Adapter, it takes 16 cycles to finally arrive at the output of
Device 2, AXI Slave Read.
In total for this scenario we have a latency of 50 cycles. A breakdown of the latency
for the transmission and reception of an AR transaction is shown in Table 4.1.

Path Added Latency in cycles


AXI input to FC 2
FC 3
FC to FDI (Transmitter) 1
FDI to FD (Receiver) 4
FD 1
FD to AXI output 1

Table 4.1: Latency of the Offload Engine for an AR transaction

It’s important to note that latency is dependant on the size and type of the transaction, as
well as the state that the buffers are in. The Receiver typically takes the highest number of
cycles due to the reliability mechanisms, the flit cancel from Adapter as well as the tag check.
During burst mode, the module maintains data throughput from the AXI side as there are
no interruptions in the data.
Going through the data on the latency and throughput from the Table 4.1, we can observe
and analysis that the latency and throughput are maintained through the design of the offload
engine. The latency of the design is dependent on the design buffers and transaction types
and these can be improved through the constant development on the AXI flits and optimal
utilization of the buffers. We can see that the data throughput is maintained int he sense that
the the AXI data is passed through the offload engine with no loss and the data throughput
can be further improved the introduction for pipeline architecture int he Flit construction
blocks helping multiple transaction to be simultaneously created.

Synthesis Results

To get a sense of the size if our design, one instantiation of the Offload Engine was synthe-
sized using Vivado for the Zynq UltraScale+ MPSoC ZCU104, which houses an FPGA with
enough blocks to facilitate our design. We took care to prohibit the tool from optimizing away
our logic by doing out-of-context synthesis, since we did not map its outputs to the pins of
the FPGA. From the results we obtained from this synthesis, we note the utilization for one
Offload Engine, shown in Figure 4.5.

65
Figure 4.5: Utilization of one Offload Engine

It is sensible for the Protocol Engine to hose the most logic blocks due to the size of
arbitration logic as well as the Receiver buffers. This is also reflected in the latency of the
Receiver.
With the raw synthesis for the offload engine we can get the sense of how big the offload
engine can be and which areas are consuming more area. The Area utilization is higher in
the Protocol engine because of the extensive utilization of the receiver buffers. In terms of
improving area utilization we can see that buffer structures like the Write response buffers are
not utilized to its full extent, thus improving the receiver buffer to accommodate multiple type
of transactions in single buffer structure can reduce space wastage and in terms of the our
design can reduce costs by upto 20 percent as Write response is 1/5th of all the transaction
type.
From the timing analysis, we also know the critical path of our design, which lies in payload
creation in the Flit Construct block.

66
Chapter 5

Conclusion

In this chapter, we will discuss the limitations and possible improvements for our design of
the Offload Engine. It will be followed by possible improvements for the offload engine.
By discussing the limitations, possible improvements, and other important considerations
for the Offload Engine, this chapter aims to provide a comprehensive understanding of the
current challenges and opportunities to further integrate the offload engine into bigger archi-
tecture.

5.1 Limitations
The Offload engine is developed as the starting point for the idea of AXI-based Flit data
transfer for the D2D framework. It is crucial to acknowledge the limitations of the offload
engine as it is essential for setting realistic outcomes, and identifying potential improvements
in the system. These are the following constraints to our design:
• Flit Construct Bottleneck: The design contains two instances of Flit Construct block
which was designed for the AXI Master/Slave interfaces to generate Flits simultaneously.
Each of the AXI interfaces further contains the respective channels that need to wait to
create the Flits as there are 5 AXI channels and 2 Flit Construct (FC) instances.
• Limited AxSIZE support: The current implementation of the Offload Engine only
provides support for AxSIZES 1,2 and 4.
• Symmetric-Only Connection: Two Offload Engines that are connected are mandated
to have the same tag spaces, the same outstanding transaction limit, and the same
allocated credits. Though theoretically possible, different tag spaces are not supported.

5.2 Possible Improvements


We will briefly discuss the possible improvements to our offload engine design:
• Utilizing UCIe’s D2D Adapter capabilities: In our work we assemble flits in a
64Byte format. This is not arbitrary but is a result of the allowed sizes of the FDI which
connects our engine on the Protocol to the Adapter Layer. The mode is described as
’Raw Mode’ and passes through the UCIe stack.

67
• Flit Construct Pipelining: The Flit Construct blocks run on the Finite State model
where for a flit to be generated the module needs to go through all the states. This
can hamper the data throughput when multiple large burst data need to create flits.
This can be eliminated if the state machine can be broken down into a pipeline, thus
maintaining the parallel processing throughout the AXI domain
• Low power mode for both the AXI Engine and Protocol Engine: With the
implementation of Clock Gating. The Adapter can request entry into a low-power state,
during which AXI transactions may be halted.
• Making use of UCIe control signals and different states: We can request a
LinkError state from the Adapter. This may be useful if, for instance, an AXI module
disconnects.
• Larger AxSIZE support: For AxSIZE under 32, not many things have to change. A
small addition to the FD and FC blocks is sufficient. For AxSIZE above 32, the whole
beat does not fit inside the flit. There has been a small discussion on how to handle this
in section 3.2.2. We would have to assemble each beat in FD and divide it in FC.

5.3 Conclusion

In essence, the objective of this thesis was to achieve an offload engine to integrate the AXI
protocol within the SoC to a D2D interconnect, in this case UCIe. The background research
included a detailed understanding of the working of advanced Extensible Interface (AXI)
protocol across the SoC subsystems, the concept of Die-to-Die systems and the development of
growing Universal Chiplet Interconnect express(UCIe) as the new D2D Interconnect standard.
The work in designing the offload engine provides a novice method to transfer AXI signals
through the D2D interconnect, where the AXI-based Flits seamlessly integrate with both
the AXI specifications and the UCIe specifications. Our design facilitates bashing the gap
between the chiplets and replicating the AXI framework across multiple chiplets. Throughout
the designing process we strived to balance:
• Complexity by setting a limit to outgoing transactions.
• Latency by allowing two flits to be Constructed and sent simultaneously
• and size by restricting our buffer sizes to a sensible and optimized amount. The buffers
act as a constraint in terms of area utilization
Through the development of the offload engine, the design showed the heavy dependency
on address/data buffers to hold valid data before transfer to maintain integrity throughout
the system. With optimization using pipelining the data throughput will be maintained and
the latency will finally only depend on the physical layer transfer. The journey of design and
extensive testing on the design concludes by maintaining data throughout for the burst mode
transactions across the Dies and observing an initial latency in the range of 20-30 clock cycles.
As a final note, a direction for future design implementation of the offload engine is to focus
on further utilizing the UCIe capabilities. The integration of the UCIe D2D adapter into the
offload engine can add more control over the data transfer. This will implement the Link State
Management, Parameter Negotiations and the CRC/Retry mechanisms to better optimize the
offload engine.

68
Bibliography

[1] J. Kim, G. Murali, H. Park, E. Qin, H. Kwon, V. C. K. Chekuri, N. Dasari, A. Singh,


M. Lee, H. M. Torun, M. Swaminathan, M. Swaminathan, S. Mukhopadhyay, T. Krishna,
and S. K. Lim, “Architecture, chip, and package co-design flow for 2.5d ic design enabling
heterogeneous ip reuse,” in 2019 56th ACM/IEEE Design Automation Conference (DAC),
2019, pp. 1–6.
[2] B. Jiao, H. Zhu, J. Zhang, S. Wang, X. Kang, L. Zhang, M. Wang, and C. Chen,
“Computing utilization enhancement for chiplet-based homogeneous processing-in-
memory deep learning processors,” in Proceedings of the 2021 Great Lakes Symposium on
VLSI, ser. GLSVLSI ’21. New York, NY, USA: Association for Computing Machinery,
2021, p. 241–246. [Online]. Available: https://doi.org/10.1145/3453688.3461499
[3] L. Roberts and B. Wessler, “Computer network development to achieve resource sharing,”
AFIPS Proceedings, vol. 36, pp. 543–549, 01 1970.
[4] L. Pouzin, “Presentation and major design aspects of the cyclades computer
network,” in Proceedings of the Third ACM Symposium on Data Communications
and Data Networks: Analysis and Design, ser. DATACOMM ’73. New York, NY,
USA: Association for Computing Machinery, 1973, p. 80–87. [Online]. Available:
https://doi.org/10.1145/800280.811034
[5] A. McKenzie, “INWG and the Conception of the Internet: An Eyewitness Account,”
IEEE Annals of the History of Computing, vol. 33, no. 1, pp. 66–71, 2011.
[6] J. Day and H. Zimmermann, “The OSI reference model,” Proceedings of the IEEE, vol. 71,
no. 12, pp. 1334–1340, 1983.
[7] Andrew L. Russell, “OSI: The Internet That Wasn’t,” IEEE Spectrum, vol. 50, no. 8, July
2013.
[8] I. S. Organization", “ISO/IEC 7498-1:1994(E) : Information technology – Open Systems
Interconnection – Basic Reference Model: The Basic Model,” International Organization
for Standardization, Geneva, CH, Standard, 1994.
[9] R. T. Fielding and J. Reschke, “Hypertext Transfer Protocol (HTTP/1.1):
Semantics and Content,” RFC 7231, Jun. 2014. [Online]. Available: https:
//www.rfc-editor.org/info/rfc7231
[10] P. Resnick, “Internet Message Format,” RFC 5322, Oct. 2008. [Online]. Available:
https://www.rfc-editor.org/info/rfc5322
[11] “Transfer Syntax NDR,” 1997. [Online]. Available: https://pubs.opengroup.org/
onlinepubs/9629399/chap14.htm

69
[12] Apple Computer Inc., “AppleTalk Session Protocol,” 1985. [Online]. Available: https:
//developer.apple.com/library/archive/documentation/mac/pdf/Networking/ASP.pdf
[13] “User Datagram Protocol,” RFC 768, Aug. 1980. [Online]. Available: https:
//www.rfc-editor.org/info/rfc768
[14] W. Eddy, “Transmission Control Protocol (TCP),” RFC 9293, Aug. 2022. [Online].
Available: https://www.rfc-editor.org/info/rfc9293
[15] Tanenbaum, Andrew S. and Wetherall, David, Computer Networks, 5th ed. Boston:
Prentice Hall, 2011. [Online]. Available: https://www.safaribooksonline.com/library/
view/computer-networks-fifth/9780133485936/
[16] V. Cerf and R. Kahn, “A Protocol for Packet Network Intercommunication,” IEEE Trans-
actions on Communications, vol. 22, no. 5, pp. 637–648, 1974.
[17] G. E. Moore, “Cramming more components onto integrated circuits, reprinted from elec-
tronics, volume 38, number 8, april 19, 1965, pp.114 ff.” IEEE Solid-State Circuits Society
Newsletter, vol. 11, no. 3, pp. 33–35, 2006.
[18] R. Ratnesh, A. Goel, G. Kaushik, H. Garg, Chandan, M. Singh, and B. Prasad,
“Advancement and challenges in mosfet scaling,” Materials Science in Semiconductor
Processing, vol. 134, p. 106002, 2021. [Online]. Available: https://www.sciencedirect.
com/science/article/pii/S1369800121003498
[19] International Technology Roadmap for Semiconductors, “2011 Executive Summary,”
2011. [Online]. Available: https://www.semiconductors.org/wp-content/uploads/2018/
08/2011ExecSum.pdf
[20] J. Lau, Heterogeneous Integrations. Springer, 01 2019.
[21] Microsystems Technology Office of the Defense Advanced Research Projects Agency,
“Broad Agency Announcement: Common Heterogeneous Integration and IP Reuse Strate-
gies (CHIPS),” https://viterbi.usc.edu/links/webuploads/DARPA%20BAA%2016-62.
pdf, 9 2016.
[22] Open Domain-Specific Architecture (ODSA) Consortium, “Open High Bandwidth
Interface (OHBI) Specification version 1.0,” 2021. [Online]. Available: https:
//www.opencompute.org/documents/odsa-openhbi-v1-0-spec-rc-final-1-pdf
[23] ——, “Bunch of Wires (BoW) PHY Specification Version 1.9,” 2023. [Online]. Available:
https://opencomputeproject.github.io/ODSA-BoW/bow_specification.html
[24] “Advanced Interface Bus Specfication.” [Online]. Available: https://github.com/
chipsalliance/AIB-specification
[25] “Universal Chiplet Interconnect Express homepage.” [Online]. Available: https:
//www.uciexpress.org/
[26] “Universal Chiplet Interconnect Express Specfication 2.0.”
[27] ARM Limited, “Axi protocol l,” 2003. [Online]. Available: https://developer.arm.com/
documentation/ihi0022/latest/9
[28] C. Ma, Z. Liu, and X. Ma, “Design and implementation of apb bridge based on amba
4.0,” in 2011 International Conference on Consumer Electronics, Communications and
Networks (CECNet), 2011, pp. 193–196.

70
[29] N. Dorairaj, D. Kehlet, F. Sheikh, J. Zhang, Y. Huang, and S. Wang, “Open-source axi4
adapters for chiplet architectures,” in 2023 IEEE Custom Integrated Circuits Conference
(CICC), April 2023, pp. 1–5.
[30] R. Budruk, D. Anderson, and E. Solari, PCI Express System Architecture. Pearson
Education, 2003.

71
Printed by Tryckeriet i E-huset, Lund 2024

Series of Master’s theses


Department of Electrical and Information Technology
LU/LTH-EIT 2024-1017
http://www.eit.lth.se

You might also like