Rapp 1016
Rapp 1016
   Novel circuit design and manufacturing methodologies are emerging into the market. Het-
erogeneous Integration allows for chiplet technologies of different process nodes to be combined
in one large package. To allow for seamless communication between them, a definition of stan-
dards is necessary, and so is the construction of interconnects that follow them. In this thesis,
we present an offload engine which bridges the popular System-On-Chip (SoC) communica-
tions standard, Advanced eXtensible Interface (AXI), with the evolving Universal Chiplet
Interconnect Express (UCIe). We dive into the details of the offload engine design process, the
challenges, and the novice methods used to connect the AXI protocol to the UCIe interface.
The resulting architecture is meant to be mounted in the Protocol Layer of the UCIe model.
Balancing complexity, latency, and size of the design we justify each decision that was taken
along the process. In this work, we also present future improvements which can be made to
the design. The result is an interface which may be used as part of a Die-to-Die interconnec-
tion, fully compatible with the UCIe standard. It may be the start of a commercial product
or provide insights into interconnection technologies. This work is relevant for designers and
researchers alike who wish to integrate their AXI-based architectures into a heterogeneous
package.
Popular Science Summary
    In the past decade, novel design and manufacturing methods have been introduced to
the field of integrated circuits. As society’s needs for design complexity grows and current
architectures, methodologies, packaging and fabrication techniques which are widespread in
the industry require constant upgrades. Miniaturization, or the shrinking of circuit components
is pushing technology to its physical limit. Heterogeneous Integration combines circuits with
different origins of fabrication and process technology for a more versatile production flow.
Meanwhile, there is also a need to bridge existing higher-level pre-existing standards with
these new techniques. In this study, we present our very own, AXI to Die-to-Die interface.
First, we introduce the background knowledge necessary to understand the context of our work,
as well as the standards we chose to follow, and why we follow them. We dive into the details
of the AXI protocol, which is widely used in SoC architectures, as well as UCIe, an emerging
die-to-die communication standard. Our work aims to bridge these two popular standards in
our very own Offload Engine. We will explain the details of our architecture, the challenges
we face and how we altered our work to overcome them, as well as the verification methods we
used to test our Engine. UCIe is an open, multiinstitutional effort to solve the heterogeneous
communication problem, in which multiple partners (Intel, Qualcomm, AMD, Arm, TSMC
and more) are actively involved. As the standard is developed, tested and integrated into more
and more designs, the relevance of our work comes to light.
                                               i
Acknowledgements
   We wish to express our deepest gratitude to our supervisors, Professor Liang Liu from
Lund University and Faruk Sande from the Ericsson BE team, for the knowledge and technical
support they have provided us throughout the development of this thesis. Their insights and
assistance have been key in our design.
   We would also like to thank Ericsson Lund for supporting this thesis topic and providing
us with the environment to explore our ideas and expand our technical background. Our
supervisor Faruk Sande has guided us throughout the whole process, aided us in exploring our
ideas and providing invaluable feedback. We would also like to express our gratitude towards
Lund University, Faculty of Engineering (LTH) for the enlightening academic experience and
skills required for this thesis topic.
   Beyond that, we would also like to thank the open source community of the UCIe Con-
sortium as well as the ARM community for providing us with the materials needed as well
as a space for discussion on the topic. Their support throughout the design phase has been
substantial.
   To finish, we are also grateful to our friends and family during this time who stood by us
with unwavering support and motivated us to do our best. In addition, we are thankful to
each other for the natural collaboration and mutual drive to achieve our greatest work.
                                             ii
Abbreviations
                                          iii
Contents
1 Introduction                                                                                                                                                 1
  1.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                 3
2 Scientific Background                                                                                                                                         4
  2.1 The dawn of Computer Networks . . . . .                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
  2.2 The OSI Reference Model . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
  2.3 Another world: TCP/IP Reference Model                               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    9
  2.4 Die to Die Systems . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  2.5 Universal Chiplet Interconnect Express . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
       2.5.1 UCIe Components . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
       2.5.2 The Protocol Layer . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
       2.5.3 The Die-to-Die Adapter . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
       2.5.4 The Physical Layer . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
  2.6 AXI Protocol . . . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
       2.6.1 AXI Architecture . . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
       2.6.2 AXI Components . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
       2.6.3 AXI Basic Transaction . . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
  2.7 Pre-Existing Work . . . . . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
3 Design Implementation                                                                                                                                       33
  3.1 Offload Engine Overview . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
  3.2 Design . . . . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35
      3.2.1 AXI Engine . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35
      3.2.2 Flit Architecture . . . . .               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
      3.2.3 Protocol Engine . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   43
  3.3 Challenges . . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   55
      3.3.1 Architecture Development                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   55
      3.3.2 Flit Development . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   56
4 Results                                                                                                                                                     58
  4.1 Verification . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   58
      4.1.1 Block Verification .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   58
      4.1.2 System Verification        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   63
      4.1.3 Results . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   64
5 Conclusion                                                                                   67
  5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
  5.2 Possible Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
  5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
                                                          iv
Chapter 1
Introduction
In the late 1960s, Frank Wanlass’s work towards developing Complementary Metal-Oxide-
Semiconductor (CMOS) technology laid the foundation for an exponential rise in the semi-
conductor Industry. CMOS is the basic framework for most electronic devices with notable
advantages over its Transistor–transistor logic (TTL) counterparts in terms of low consump-
tion in idea state, speed, and wider voltage range. The demand for faster and more efficient
chips has sparked immense developments.
    However, despite the extensive work on minimizing the CMOS chips for smaller and faster
processing, several challenges and limitations have been identified. As the transistors decrease
in size and get more densely packed, notably 3nm chips used in the new Apple iPhones.
Power efficiency particularly becomes challenging, with increased power density and leakage
currents leading to higher power consumption per chip. Shrinking feature size leads to yet
another challenge, Interconnect scaling. The increase in resistance-capacitance (RC) delays
due to Interconnect scaling leads to limited signal propagation speeds hampering the overall
performance. Moreover, semiconductors are becoming susceptible to process variation and
reliability issues caused by electromigration and ageing effects on the chips, further depleting
on-chip performance. Finally, increased complexity in both design and fabrication processes
has increased manufacturing costs for the major players in the semiconductor industry.
   Enter chiplets – modular semiconductor components that offer a novel solution to the
challenges of traditional monolithic chip design. By breaking down complex integrated chipts
(ICs) into smaller, specialized modules, chiplets empower greater design flexibility. The study
found that chiplet-based design approaches can achieve up to a 20% reduction in design time
compared to monolithic designs. Additionally, chiplets improve yield rates [1]. By isolating
functionalities into separate chiplets, defects in one chiplet can be discarded without impacting
the entire chip, leading to yield improvements of up to 30% according to Zhang et al. [2].
Chiplets also enhance customization options, allowing designers to mix and match chiplets
from various vendors to create application-specific SoCs.
   The emergence of chiplets marks a paradigm shift in semiconductor technology, present-
ing a viable path to surmount the limitations of conventional monolithic chips and unlock
unprecedented levels of performance, scalability, and energy efficiency.
                                               1
   Chiplets offer potential solutions to the limitations of traditional chip design, yet their adop-
tion faces challenges. Integrating and coordinating communication between chiplets within
larger systems pose engineering hurdles. Concerns about yield, reliability, and testing strate-
gies remain areas of active research. Despite these challenges, chiplets enable incremental
upgrades, reduce time-to-market, and promote the reuse of proven designs and intellectual
property
    The focus of this thesis lies in designing an offload engine that integrates a Die-to-Die
communication protocol like UCIe(Universal Chiplet Interconnect Express) to an established
internal SoC (System on Chip) communication protocol the AXI(Advanced eXtensible Inter-
face). With acquiring a working understanding of similar physical communication interface IPs
like PCIe(Peripheral Component Interface express), in the simulation environment provided
by Ericsson.
   This thesis is divided into four parts. We start with a scientific background which aims
to provide a deeper understanding of the key concepts of AXI, UCIe, the protocol engine
and how they all work together. Next, we proceed with the implementation of the offload
engine through the designing and scripting of the architecture. Following the implementation,
the results will be presented to provide the evaluation of the potential and limitations of
the offload engine for Die-to-Die Communication integration. Finally, we will examine the
limitations, possible improvements, and other important considerations for the offload engine
in the discussion chapter.
                                                2
1.1    Stakeholders
   This thesis project was proposed and hosted by the ASIC SoC Integration and BE team
at Ericsson, Lund, Sweden. The company proposed the research topic, which focuses on
evaluating the Die-to-Die communication protocols and designing an offload engine to integrate
Die-to-Die communication in the SoC. Ericsson also provided essential resources and support
for the successful completion of this project
 Stakeholders          Benefits
 Ericsson Lund         1. Academic and Technical support from the team.
                       2. Access to designing and simulating tools, as well as to the UCIe Consortium.
                                               3
Chapter 2
Scientific Background
   In this chapter, we aim to cover all the necessary topics and provide background knowledge
so that the reader can understand the depth of our work.
   There will be an overview of computer networks, the previously presented die-to-die sys-
tems, an analysis of the specification this work is focused on, and a presentation of the princi-
ples on which the work is based such as the commonly known PCIe (Peripheral Interconnect
Express) protocol layer and the AXI (Advanced EXtensible Interconnect) protocol, as well as
a look into a publication of a solution to the problem we worked to solve.
                                               4
Figure 2.1: Connections in the CY-                        Figure 2.2: Connections in the
CLADES network (1973)                                     ARPANET network (1970)
style of communication Vinton Cerf, the first chairman of INWG, was proposing due to their
heavy investment in connection-orientated communications. He later resigned and started
working in ARPA (Advanced Research Projects Agency) on their packet-switching network.
In 1977, members of IWGN who represented the British computer industry proposed the Open
Systems Interconnection (OSI) [6] under a need to develop network standards needed for
open working. This proposal was made by the International Standards Organization (ISO).
                                               5
are considered the basic vocabulary of network architecture. A depiction of the OSI reference
model is shown in Figure 2.3.
   In the reference model, seven layers are defined. A brief explanation of each layer follows
so that the reader may get a sense of the context in which our work fits.
   • The Application Layer
     The Application Layer serves as the gateway for an application process of a system to
     access the OSI environment. An application process may be represented by an applica-
     tion entity. Services provided by the application, besides the transfer of data, include the
     identification of intended communication partners, the determination of the acceptable
     quality of service, synchronization, agreement on security aspects, and many others. The
     Application Layer comes with its own set of protocols, some of which are in the public
     domain. A popular example is HTTP (the HyperText Transfer Protocol [9]) which is
     published and maintained by the Internet Engineering Task Force (IETF). The Web is
     the Internet’s client-server application which allows the transfer of information by the
     users. HTTP, which is the Web’s application-layer protocol, defines the format and
     sequence of messages exchanged between an Internet browser (the interactive interface
     of the Web application) and the user. So application-layer services and protocols are
     often just one part of a larger application. The IETF has published SMTP or Simple
     Mail Transfer Protocol [10] known as the basis of electronic mail. It is but one piece
     of an e-mail application, such as Microsoft Outlook. Other protocols may be propriety
     and not available to the public, such for example the application-layer protocols used by
     Skype.
   • The Presentation Layer The Presentation Layer is responsible for the syntax and
     semantics of the information transmitted between two application entities. This prevents
     any problems related to a mismatch in the representation of data between application
     entities- it provides syntax independence. Examples of this are data conversion such as
     character code translation from ASCII to EBCDIC or integer to floating point. Different
     file formats accessed by the Application Layer such as JPEG (Joint Photographic Experts
     Group) or PDF(Portable Document Format) may be translated to a standard format
     fit for transmission and visa versa. An example of an implementation of this layer is
                                               6
  NDR (Network Data Representation [11]) which is used by a software system called the
  Distributed Computing Environment. Modern protocols which cover presentation layer
  functionalities are not strictly defined in the layer reference model from OSI.
• The Session Layer The Session layer allows users on different machines to establish
  sessions between them. It handles the synchronization and organization of dialogue
  between two entities. Sessions created are tied to session addresses. Session Layer
  functions also execute token management and synchronization- which is the reset of a
  session the re-synchronization of a session, or the continuation of the dialogue from an
  agreed re-synchronization point, or defined state, with potential loss of data. Famous
  examples of protocols in this layer include the AppleTalk Session Protocol [12] which
  was produced by Apple in the mid-1980s as part of the AppleTalk protocol suite but
  eventually phased out when TCP/IP (Transmission Control Protocol/Internet Protocol)
  networking standards became preferred over OSI.
• The Transport Layer The Transport Layer plays a vital role not only in the OSI
  reference model but in most network communications models in use today, having been
  widely adopted by designers. The Transport layer is the last end-to-end layer in the
  model, meaning, it is the last layer in which entities have logical communication be-
  tween one another uninterrupted by other entities. The layers under the Transport
  Layer may only communicate with their immediate neighbors, and not between the ulti-
  mate source and destination machines, which may be separated by other machines such
  as routers or switches. The Transport layer connects Session Layer entities to transport-
  addresses providing the bridge to their corresponding entities on another machine. It
  then maps transport-addresses to network-addresses for the network layer, multiplexing
  the transport connections onto network-connections. It also controls these connections,
  implements error detection and error recovery on the end-to-end data, supervisory func-
  tions and flow control and packeteting data into correct formats. The Protocol Layer in
  the UCIe standard takes on the role of the OSI Transport layer in multiple ways. Most
  notable examples of a Transport Layer protocol include UDP (User Datagram Protocol
  [13])- which is connection-less and TCP (Transmission Control Protocol [14]) which in-
  vokes a connected-oriented, and more reliable service to the invoking application. Both
  protocols are published by the Internet Task Force, and both are used by the Internet.
• The Network Layer The Network Layer has arguably the heaviest load of all layers
  in a communications model- and most reference models have their version of it. The
  Network layer provides the functional and procedural means for connection between
  entities of the Transport Layer. The layer can establish, maintain and terminate network-
  connections. The original OSI reference model categorizes the many functions of the
  network into sublayers or subgroups, with the most characteristic of them being routing
  and relaying. The network layer assigns each transport entity a network address and
  manages the connections between all the entities in the system by employing a routing
  algorithm. These connections may be static or dynamic, remote or local- this is handled
  by the network layer and obscured to the transport layer. The network layer has the
  responsibility to deal with resource allocation and scheduling to avoid congestion in the
  network. In many complex cases, the nature of this responsibility is complex, but it may
  be simple, such as in a broadcasting network.
  The network layer is not concerned with the physical means of the connection, which is
  handled by the lower layers, but with the network layout on a device level. This layer and
  its functionality exist in endpoints of a network, but also in intermediate devices, such
  as routers. An example of a network and devices which employ network layer functions
                                           7
  is shown in Figure 2.4.
• The Data Link Layer As we traverse down the reference model into the lower levels and
  come closer to a physical implementation of a network we realise that the aforementioned
  connections between entities of host devices are segmented into smaller connections or
  links. The management of these links is managed by the data link layer. Looking
  back at Figure 2.4, all devices are either routers or of higher complexity, but in reality,
  there might be more intermediate devices, such as a switch, which hold data link layer
  functionality. The Data Link Layer presents a seamless connection to the Network Layer,
  applying error connection and sometimes flow control on the data sent by the link which
  connects two data link layer entities. The passing from the network layer to the data
  link layer is also segmented into data frames, which are of a smaller size than the
  packets given by the transport layer. Two nodes, or endpoints of a link, communicate
  by exchanging data frames and acknowledgement frames. There are two different types
  of link layer channels- broadcast channels such as wireless local-area networks (WLANs)
  or hybrid fiber-coaxial cables (HFCs) and point-to-point communication links such as a
  long-distance link between two routers, or the connection between a computer and an
  Ethernet switch. When it comes to a broadcasting channel multiple hosts are connected
  to the same communications channel so a medium access protocol is used to coordinate
  frame transmission. Coordinating access to a simple link is easier- an example of a
  protocol which serves this is the Point-to-Point protocol, used in serial cables and phone
  lines as well as cellular connections. Ethernet is a common data link protocol used in
  WLANs.
  The Link Layer is often implemented in a network adapter, sometimes known as a
  network interface card (NIC). The controller inside the NIC is usually a special-purpose
  chip therefore the link layer functions are implemented in hardware. Later on, we will
  see the parallel between the data link layer and the ’adapter’ component of the UCIe
                                            8
      stack.
   • The Physical Layer The Physical Layer can be considered as the implementation of
     the Network Layer. It provides the mechanical, electrical functional and procedural
     means to activate, maintain and de-activate network layer connections. Physical layer
     entities are connected through a physical medium, through which the bits of information
     and transmitted. The type of medium as well as the digital and analog circuits that aid
     in its transmission are all part of the physical layer. The medium, or media, can be
     grouped into two categories: guided, such as copper wires or fibre optics, or unguided
     such as wireless or satellite transmissions. Phone networks across the globe rely on
     electromagnetic transmission utilizing radio frequencies of the electromagnetic spectrum.
     From the full 104 Hz to 1016 Hz electromagnetic spectrum, the different uses of it for
     different physical media are shown in Figure 2.5.
Figure 2.5: The electromagnetic spectrum and its uses for communication [15]
      There are various for encoding data into frequencies that have been developed over the
      years. The process of converting digital bits and symbols into analogue signals is called
      digital modulation. A modem (modulator, demodulator) is a device that converts a
      stream of digital bits to an analogue signal. Along with modulation, hardware in the
      physical layer also performs multiplexing, which is the transmission of multiple signals
      on the same channel. Multiplexing may be done using time or frequency intervals, known
      as Code-division multiple access (CDMA).
   The TCP/IP Model is a model which predates the OSI model by 10 years. This reference
model has influenced much of the OSI architecture, but in turn, developments in OSI influenced
the trajectory of TCP/IP as well. It has its origins in the protocol Vinton Cerf published with
Robert Kahn [16]. This was the first true description of TCP, described by Vinton and Robert
as a Transmission Control Program, which has now grown to mean Transmission Control
Protocol, refined and defined as a standard in the Internet community. Part of the acronym,
IP, refers to the Internet Protocol used in its lower layers to format data packets.
                                              9
   The focus of the TCP/IP model is the protocols rather than the layer. That is why in its
visual representation, seen in Figure 2.6, the parallels between the reference model and OSI are
not straightforward. The rigid structure is limiting- omitting protocols in OSI allows the user
to replace them as the technology changes, which is the main purpose of the layered structure
in the first place. However, the fact that OSI was created with no protocols in mind led to its
demise as there was no practical experimentation to support the choices in the architecture.
Focusing on functionality and having methods by defining several specific protocols for the
model is what made TCP/IP so successful and widespread in today’s internet. Protocols used
in TCP/IP may also be used in an OSI context, such as in the Transport Layer of OSI as
previously described.
   Our analysis of computer networks ends here. For more information on network communi-
cation methods, as well as some insights into the politics that surrounded OSI and TCP/IP,
the reader may refer to Andrew Tanenbaum and David Wetherall’s "Computer Networks" [15]
as well as other previously cited sources. The purpose of this section was to lay the foundation
for the layered architecture that UCIe provides and to give a brief insight into how the layered
architecture concepts came to be in the first place. In the 1970s and through standardization
the world was striving to connect computers and devices of any type across the globe. In the
2000s and this work, we strive to connect chiplets of any die across the package, an innovation
which will prove to be just as revolutionary.
    In 1965 Gordon Moore observed that due to the growing complexity of circuit designs, the
number of transistors in an integrated circuit would double every two years [17]. To counter
the increasing area in integrated circuits, the semiconductor industry worked towards the most
logical solution to the problem: to decrease the area of the transistor. This movement, whose
basic parameter is largely the minimum dimension that can be lithographically reproduced:
ie, the gate length of the transistor, is known as miniaturization. However, the technology of
CMOS digital circuits has been significantly decreasing over the last decade with every new
                                              10
node being ramped into manufacturing roughly every two years as designers and foundries
strive to keep up with the accruing amount of transistors needed for their ever-growing design
complexity.
   The shrinking process technology pushes a circuit to its physical limitations and comes
with its own sets of problems- short channel, tunnelling, and threshold voltage effects have
to be taken into consideration and design methods have to be altered. Power loss, reliability
issues and temperature instability are especially new challenges enhanced in the nanometer
scale [18].
   The approaching physical limitation plateau is acknowledged in the decisions and predic-
tions of industry experts. The International Technology Roadmap for Semiconductors (ITRS),
a set of documents created and published by the Semiconductor Research Corporation with
the sole purpose of technological assessment and predictions of trends, was long guided by
Moore’s law since 1998. Since the 2005 IRTS, an initiative has already begun that looked to
drive innovation beyond semiconductor scaling. Two sides of this initiative were More Moore,
or looking into novel digital integrated circuit methods such as Beyond-CMOS technologies
in parallel to miniaturization, and More than Moore, or focusing on the performance of
analogue components such as sensors. The objective of the roadmap is shown in a schematic
representation in Figure 2.7.
Figure 2.7: ITRS 2011 predicted trends in the semiconductor industry [19]
   The combination of these two technologies is equally critical. The integration of CMOS and
non-CMOS-based technologies within a single package, also known as System-in-Package
(SiP), is becoming increasingly important. System in Package is a subset of a larger tech-
nological approach called Heterogeneous Integration. This superset combines different
packaging technologies to integrate dissimilar chips, photonic devices, or components with
different materials and functions, and from different fabless design houses, foundries, wafer
sizes and companies into a system or subsystem [20]. Heterogeneous integration encompasses
advanced packaging technologies such as multi-chip modules (MCMs), SiPs, 2.5D, fan-outs
and 3D-ICs. The interconnect standard which we follow in this work supports 2D, 2.5D and
as of this year 3D packaging, as we will discuss later.
                                             11
   Similarly to how computer networks were pioneered by the Advanced Research Projects
Agency Network of the United States Department of Defense (DARPA) in their ARPANET
network so is heterogeneous integration for chiplets [21]. In their CHIPS (Common Het-
erogeneous Integration and IP Reuse Strategies) program Board Agency Announcement they
propose to establish a modular design and fabrication flow for electronics systems which are
bound to an interface standardization. This refers to systems that can be divided into func-
tional circuit blocks, or chiplets, which are re-usable IP (Intellectual Property) blocks that
are predesigned and already realized in physical form. In short, the aim is to devise a modular
design and manufacturing flow for chiplets. In the BAA they solicit bids from outside compa-
nies to develop a large catalogue of third-party chiplets for commercial and military apps. As
proposed, CHIPS flow is expected to lead to a 70% reduction in design cost and turn-around
times.
   In a multi-die architecture, chiplets need to be connected on a substrate or interposer
to provide data transfer across the whole package. A definition for such an interface that
binds chiplets to the interposer is then necessary to define and standardize the operation of
chiplet-based architectures.
   When looking at a list of interconnect standards, most are proprietary and developed by
major semiconductor vendors, used in their environment for their advanced packages. There
are multiple initiatives, such as work by Open Compute’s Open Domain-Specific Architecture
Initiative which have been developing open standards such as the Open High Bandwidth
Interface (OHBI) [22] or Bunch of Wires (BoW) [23]. Notably, both are open standards,
encouraging the growth of a chiplet marketplace. However, both are only specified for the
PHY or physical layer. In this category also falls the Advanced Interface Bus (AIB) [24], first
developed by Intel and then taken on by the CHIPS alliance. Later in this chapter, we will
introduce a previous work with a similar objective to ours which utilizes this standard.
   A new open standard that could significantly alter this dynamic is the Universal Chiplet
Interconnect express (UCIe) standard [25], which aims for chiplet interoperability across
vendors. UCIe has higher capabilities than the interconnects before and is defined across
multiple abstraction layers. Already in the UCIe specification 1.0, the UCIe protocol stack
has been established; The protocol layer, a die-to-die adapter layer, and the physical layer. It
draws from pre-existing standards: PCIe (Peripheral Component Interconnect Express) and
CXL (Compute Express Link), which are both open serial bus communication, interconnect
and architecture standards, making for an easier transition. Companies such as Synopsis and
TSMC have already released their IPs on the Die-to-Die adapter. The previous characteristics
are all reasons which make UCIe attractive and constitute the reason we follow this particular
standard for our work.
    The creation of UCIe was motivated by the need to enable an ecosystem that supports dis-
aggregated die architectures. Much like OSI, UCIe is a layered, multi-protocol open standard-
with a specification defined to ensure interoperability across a wide range of devices with dif-
ferent performance characteristics [26]. UCIe supports multiple protocols already in practice
throughout the industry while also allowing the designer to implement their own, taking ad-
vantage of the rest of its layered architecture, a mode called Streaming. The designer is also
allowed to send raw data to the adapter, which is what we chose to do in our work.
                                              12
   An example of a package composed of heterogeneous chiplets connected by UCIe is shown
in Figure 2.8.
Figure 2.8: A package composed of CPU Dies, an Accelerator Die. an I/O Tile Die connected
through UCIe
   By allowing compatibility with PCIe and CXL, UCIe can leverage the unique characteristics
of these standards, such as I/O or cache coherency, specifically meant to benefit the different
components. Learning from the past, with OSI’s shortcomings, UCIe is legitimized by pre-
existing standards that have been in use in the industry for years.
   In this section, we will explain some key components of the standard in the next few sub-
sections. Firstly, we will present the architecture, talking briefly about each component. We
will largely focus on the Protocol Layer and its connection to the layer beneath it. This
information is from the current UCIe specification version 1.0, revision 2.0, published on
August 6th, 2024.
                                             13
optimized. However, the basic structure is still there. We will refer to the PCIe stack further
in this section to explain the transition and context in which UCIe exists.
   The UCIe layer stack can be seen in Figure 2.9 where the partition of its architecture is very
clear: a topmost Protocol layer, the intermediate Die-to-Die Adapter layer, and the Physical
Layer. Between the Protocol Layer and the Adapter rests the Flit-aware Die-to-Die Interface
(FDI) which is properly defined in the specification as a set of signals and different versions
with different widths for the mainband. Similarly defined is the Raw Die-to-Die Interface
(RDI) which is situated between the Adapter and the Physical Layer.
   The UCIe data path on the physical bumps is organized in groups of Lanes or Modules. A
module is the smallest functional unit that is recognized by the AFE, or Analog Front End.
The maximum number of Lanes allowed for a Module is defined by the package type. An
instance of a Protocol Layer or D2D Adapter can send data over multiple modules according
to the bandwidth needs of the application.
   The physical UCIe Link is composed of two connections: sideband and mainband.
    Mainband is the main data path of UCIe. Components of this connection are a for-
     warded clock, a data valid pin, a track pin and data Lanes that make a module.
    Sideband is the connection used for parameter exchanges, register accesses for debug/-
     compliance and coordination purposes or essentially the communication between the
     Die-to-Die Adapters of connected devices. It consists of a forwarded clock pin and a
     data pin in each direction. The clock is fixed at 800 MHz regardless of mainband config-
     uration. The sideband logic is on auxiliary power and "always on" domain as it is used
     to train, retrain, reset, and bring the devices out of a low-power state.
   UCIe supports three heterogeneous integration packaging methods: Standard Package
which is 2D, Advanced Package or 2.5D, and as of August 2024, UCIe-3D, a fully vertical
packaging method.
   For the Standard Package option, the number of Lanes can be 16, (Standard Package
x16) or 8 (Standard Package x32). This integration technology is used for low cost and long
                                               14
reach, 10mm up to 25mm from bump to bump. The interface for the Standard Package is
shown in Figure 2.10, and the physical characteristics are shown in Table 2.1.
              Index                                         Value
     Supported speed per lane         4 GT/s, 8 GT/s, 12 GT/s, 16 GT/s, 24 GT/s, 32 GT/s
          Bump Pitch1                                  100 μm to 130 μm
       Short Channel Reach                                  10 mm
       Long Channel Reach                                   25 mm
    Raw Bit Error Rate (BER)                1e-27 for ≤ 8 GT/s, 1e-15 for ≥ 12 GT/s
   For the Advanced Package option, the number of Lanes can be 64, (Advanced Package
x64) or 32 (Advanced Package x32), as well as 4 additional pins for Lane repair purposes.
This integration technology is used for performance-optimized applications, with a channel
reach of less than 2mm from bump to bump. The interface can be realized with several
different interposers or methods, three of which are shown in Figure 2.11, and the physical
characteristics are shown in Table 2.2.
   1
     Bump pitch is the centre-to-centre distance between adjacent solder bumps on the semiconductor die or
substrate.
                                                   15
Figure 2.12: An example of 3D Die                      Figure 2.13: Connection of 2 chiplets
stacking [26]                                          in UCIe-3D [26]
            Index                                      Value
    Supported speed per lane     4 GT/s, 8 GT/s, 12 GT/s, 16 GT/s, 24 GT/s, 32 GT/s
         Bump Pitch                                25 μm to 55 μm
        Channel Reach                                  2 mm
   Raw Bit Error Rate (BER)           1e-27 for ≤ 12 GT/s, 1e-15 for ≥ 16 GT/s
   For UCIe-3D, the UCIe architecture is significantly more durable than the previous pack-
aging technologies. Since in this scenario, chiplets are stacked one on top of the other, the
circuit and logic must fit within their bump areas, which must be identical. An example of
Die-to-Die stacking is shown in Figures 2.12 and 2.13.
                                             16
    Due to the high density of connections, lower operating frequencies and a simplified circuit
is allowed. The Die-to-Die adapter was also eliminated- there is no need for Retry or CRC
Mechanisms. The SoC logic connects directly to the physical layer. The physical layer is also
minimal, such as a simple inverter/driver. The width of each cluster, the id and the number
of leans for each module are also significantly increased: up to 80. Physical characteristics of
UCIe-3D are shown in Table 2.3.
                Index                               Value
                Supported speed per lane            up to 4 GT/s
                Bump Pitch                          ≤ 10mm for optimized mode,
                                                    > 10mm-25mm for functional
                                                    mode
                Channel Reach                       is vertical 3D
   The BER is low due to the low frequency and almost zero channel distance, however, the
spec doesn’t provide metrics as this addition is quite new.
                                               17
   Overall, the UCIe protocol layer is designed to manage the complexities of protocol-specific
data exchanges, ensuring that data is accurately formatted, transmitted, and received accord-
ing to the standards set by UCIe, PCIe, and CXL specifications. By supporting multiple Flit
modes and incorporating robust control mechanisms, the UCIe protocol layer plays a pivotal
role in maintaining the reliability, efficiency, and interoperability of chiplet-based systems.
   Our design focuses on the 68B Format under the Streaming protocol, where we defined
AXI-based Flits and maintained a credit system and internal buffers to send and receive these
respective 128B flits. The reason for the choice of the streaming protocol is the flexibility of the
mode and the development of the offload engine focuses on creating unique flits that are de-
signed for AXI protocol, and the framework of having two instances of Flit Construction blocks
that generate and create flits for both the AXI Master and Slave interfaces simultaneously.
      this code produces a 3-bit detection guarantee meaning it can detect up to a burst of
      3 consecutive bits which may have changed during transmission. The CRC is always
      computed over 128 bytes of the message. For smaller messages, the message is zero-
      extended in the MSB. Any bytes which are part of the 128B CRC message but are not
      transmitted over the Link are assigned to 0. UCIe provides Verilog code for the CRC
      code generation alongside the specification to the designers which is to be used as a
      "golden reference".
   • Retry is a mechanism which is found in the standards which are supported by UCIe.
     In PCIe, there is a buffer called TLP (Transaction Layer Packet) Replay buffer found in
     the PCIe Data Link Layer. In this buffer, packets are stored and deleted once there has
     been an acknowledgement from the receiving device’s Data Link Layer that the packet
     has been received properly (transaction completed). This is known as an ACK/NACK
     protocol and the completion is sent in a DLLP (Data Link Layer Packet). The Retry
     Scheme in UCIe is a simplified version of the modern PCIe Flit Retry mechanism. The
     foundation which was described above serves to explain this mechanism, only in this
     case, we have Nak and Ack flits.
                                               18
   • Runtime Link Testing using Parity is a UCIe mechanism to gauge the reliability
     of the link by periodically inserting parity bytes in the middle of the data stream. This
     is an optional mechanism enabled by the software of each device, then the appropriate
     messages are exchanged via the UCIe sideband.
   We will not look into the details of the sideband in this work as we do not utilize it, but
there is a specified sideband for control messages with specified formats and their flow in
UCIe much like PCIe messages. The dedication of a sideband for control messages
contributes to UCIe’s high performance.
                                               19
                     Figure 2.14: Stages of UCIe Link Initialization [26]
   Stages 1 and 2 are relevant to the Physical Layer so we will discuss them in subsection 2.5.4.
Stage 3 is the responsibility of the Adapter, which acts according to information advertised
by the Protocol Layer. This process is divided into three parts:
   • Part 1: Determine local capabilities. When the Physical Layer has finished Link Train-
     ing, parameters about the link characteristics are made available to the protocol layer.
     Link speed and configuration are among these, so then the Adapter determines if Retry
     must be enabled for the Link Operation. The ability to support Retry would then be
     advertised to the remote link partner through parameter exchanges.
   • Part 2: Parameter Exchange with Remote Link Partner The parameter exchange
     is the process in which each device advertises its capabilities to the other. This is done
     by transmissions on the sideband by specific sideband messages the format of which is
     specified in the UCIe specification. Among these capabilities are:
        – Whether the device is an upstream or downstream port, a capability is relevant if
          the device is a UCIe Retimer. Retimers are beyond the scope of this work.
        – If Enhanced Multi-Protocol has been enabled, in the case of multiple protocol stacks
          connected to the same adapter.
        – The bandwidth allowed for Stank 0 and Stack 1 if multi-protocol stacks are enabled.
        – If Management Transport Protocol is supported by the Protocol and Adapter Layer.
          This new protocol is specific to version 2.0 of UCIe and it enabled the transport of
          management port messages over the mainband.
        – The various flit formats are supported in multi-protocol mode. 68B Flit Format,
          Standard 256B End Header Flit Format, Standard 256B Start Header Flit Format,
                                               20
          Latency-Optimized 256B without Optional Bytes Flit Format, Latency-Optimized
          256B with Optional Bytes Flit Format.
     Parameter exchanges for different scenarios depending on the protocols supported are
     shown in the specification but omitted from this thesis. The Adapter must implement a
     timeout of 8 ms (-0%/+50%) for successful Parameter Exchange completion, including
     all of Part 1 and 2. The timer only increments while RDI is in Active state.
   • Part 3: FDI bring up flow This part is the most relevant for this work as it requires
     our engagement as the protocol layer, and is reflected in section of Chapter 3. Once
     FDI is in an Active state, it concludes stage 3 of the initialization stage and Protocol flit
     transfer on Mainband may begin. The data width on FDI is a function of the frequency
     of operation of the UCIe stack as well as the total bandwidth being transferred across
     the UCIe physical Link (which in turn depends on the number of Lanes and the speed
     at which the Lanes are operating). The flit formats which are allowed are decided by
     the adapter before being communicated to the Protocol Layer.
Each FDI of a corresponding protocol stack has its state machine, the Adapter has its own Link
State Machine (LSM), and the RDI connected to it also has its state machine. The hierarchy
based on which the Adapter makes decisions is defined in the specifications and depends on
the state transitions. The transitions which are of importance to us are the transitions on
FDI requested by or reflected to the protocol layer so we will omit the details on this part,
however, the full FDI state machine can be seen in Figure 2.15. The Adapter has the following
capabilities:
   • Retrain: The RDI propagates retraining of the link to all Adapter LSMs that are in
     Active state.
   • LinkError: can be raised by the physical layer or requested by the protocol layer.
   • LinkReset or Disabled is negotiated with the remote partner via the sideband and
     propagated to the RDI. LinkReset enabled the re-negotiation of parameters.
   • Power Management States: L1 and L2. The main difference between these low-
     power states is the way to exit out of them, either by resetting the link or just retraining
     it. These states are meant to indicate to the Physical Layer to perform power manage-
     ment optimizations. Also decided via sideband, the remote partner can choose to not
     acknowledge the request.
                                               21
                         Figure 2.15: State Machine of the FDI [26]
   The physical layer is considered not only as the tangible aspect of the die-to-die intercon-
nect, but also the functions being done to prepare the data sent by the adapter, as well as the
gateway for communication with the linked device. The physical layer can be divided into two
functional entities: the Logical Physical Layer, and the Electrical Physical Layer.
   The Logical Layer includes instructions and protocols that dictate the flow of flits on the
link. These are implemented by digital circuits. Besides managing the state of the physical
link, these functions are mostly associated with data transfer.
   • Link initialization, training, and power management. Parameters that are ex-
     changed between the two remote UCIe partners before data transfer can start include in-
     formation about the link. There is a Link Training State Machine which is followed, dur-
     ing which first the sideband is initialized (SBINIT) followed by the mainband (MBINIT),
     which in itself has many sub-states. The parameters that are exchanged on the sideband
     which aid during the training of the Mainband include:
        – Voltage Swing
        – Maximum Data Rate of the UCIe Link
        – Clock Mode
        – Clock Phase
        – Module ID, for multi-module configurations
        – UCIe-A x32, or the capability of an x64 Advanced Package Module to operate with
          an x32 Advanced Package Module.
                                              22
   • Byte to Lane mapping for data transmission over Lanes has been a prime func-
     tion of the physical layer since older versions of PCIe. Depending on the package and
     lane width, bits can be placed in different configurations. An example of mapping of a
     256B Flit standard on a Standard Package x16 is shown in Figure 2.16.
                                              23
is composed of FIFOs, de-skew circuits, etc. It uses a receiver clock to sample incoming data.
De-skew, when necessary, is performed during Training. UCIe specification also provides the
channel characteristics for both Standard and Advanced Package, as well as UCIe-3D.
                                             24
2.6    AXI Protocol
                                              25
   This section provides an extensible overview of the AXI protocol, beginning with the fun-
damental principles and basic architecture. We will further move into the specific components
and the signals governing the AXI protocol. Through examining the transaction processes,
and practical implementation of the protocol we will have a better understanding of the AXI
protocol.
                                             26
                        Figure 2.17: AXI4 Read Transaction [27]
   The separation of address/control and data channels allows AXI to implement a pipelin-
ing architecture, where outstanding transactions and be supported, thereby improving the
throughput and latency of the interconnect.
                                           27
2.6.2    AXI Components
Channel definitions
All the five independent channels of the AXI protocol consist of a set of information signals
and use a two-way VALID and READY handshake mechanism. The source uses VALID
to indicate valid data or address/control information on the channel. The destination uses
READY signal to indicate the readiness to accept data or address/control information.
   Address Channels
   Both the read and write transactions have their address channel. The address channel
carries the address and all the required control information for the transaction. The protocol
supports a wide range of mechanisms. These follow variable-length bursts up to 16 data
transfers per burst, varied transfer sizes of 8-1024 bits, wrapping, incrementing and non-
incrementing bursts and system-level caching control.
   Read data Channel
   The read data channel carries both the read data and the response information to the read
data from the slave interface to the master interface. The channel supports varied data bus
sizes up to 1024 bits and extensive read response showing the completion status of the read
transaction.
   Write data Channel
   The write data channel carries the write data from the master interface to the slave interface
to be written. The channel supports varied data bus sizes up to 1024 bits wide and includes
a strobe lane to indicate which bytes of data bus are valid.
   Write data channel information is always treated as buffered so that the master can perform
write transactions without slave acknowledgement of the previous write transactions.
   Write response Channel
   The write response channel provides a way for the slave interface to indicate the write
completion response to the master interface. The completion occurs for each burst transaction
not single beats of data.
AXI Interconnect
Bigger systems consist of several master and slave interfaces connected through the Inter-
connect component. The interconnect services as the bridge between a multi-system AXI
implementation. The interconnect plays a crucial role in keeping the balance between the
interface complexity and maintaining system requirements. Most systems use one of the three
interconnect approaches:
   • shared address and data buses
   • shared address buses and multiple data buses.
   • multilayer, with multiple address and data buses
                                               28
                             Figure 2.19: AXI Interconnect [27]
                                              29
Write Transaction:
The master interface initiates the AXI write transaction by issuing a valid address signal on
AXI Write bus, AWADDR and issue all the control information respectively to indicate the
nature of the data transfer and assert the AWVALID to indicate that the address channel is
valid. AWVALID signal must remain asserted until the slave accepts the address and control
information by asserting the AWREADY signal.
   Once the address channel completes its transaction the master interface can now initiate the
data transfer by using the data channel signals like WDATA, WLAST etc. through asserting
WVALID signal throughout its data transfer across multiple beats. The WVALID signal
remains asserted till all the data is accepted by the slave interface by asserting WREADY
signal.
   The slave interface asserts the BRESP signal when all the valid data has been accepted
by the slave interface along with BVALID signal, indicating the master interface about the
transaction completion. The BVALID remains asserted till the master can accept the signal
by asserting BREADY signal.
   We can visualise these communication signals from the figure which depicts the nature of
the independent nature of the channels and the coordination of the address and data flow.
                                              30
Read Transaction
AXI read transaction utilizes the read address and Read data channels, master interface initi-
ates AXI read transaction by issuing the valid address signal on the AXI read bus, ARADDR
and issues all the control information to indicate the nature of the data transfer and assert the
ARVALID, indicating that the address channel is valid. ARVALID remains asserted until
the slave accepts the address and control information by asserting AREADY.
    In AXI protocol, the read transaction is different from the read transaction, where the read
data and read response are transferred on the same channel. Once the address is captured the
slave interface initiates valid read data to be sent to the master interface via the RDATA and
RRESP signals, while asserting RVALID signal. The RVALID is asserted till the RLAST
is asserted indicating the completion of read data.
   The read transactions are depicted in the figure mentioned below.
                                               31
2.7    Pre-Existing Work
                                              32
Chapter 3
Design Implementation
   In this chapter, we will first present the necessity of an offload engine to integrate the system-
on-chip (SoC) chiplet framework with a Die-to-Die interconnect, specifically UCIe (Universal
Chiplet Interconnect Express). We will then describe the detailed process of designing the of-
fload engine, elaborating on the approach towards developing the subsystems that constitute
the offload engine, such as the AXI interface, protocol engine, and flit construction/deconstruc-
tion. In addition, we will discuss the challenges encountered during the design of these blocks,
providing insights into the solutions and design considerations that were implemented to over-
come these obstacles. This comprehensive overview aims to offer a clear understanding of the
offload engine’s design and its critical role in enhancing the functionality and performance of
the SoC in the chiplet framework.
   After introducing AXI, and UCIe, the task that our Offload Engine has to complete is clear.
The AXI protocol has multiple channels through which transactions are sent and completed in
parallel. The UCIe Protocol Layer is the abstraction layer in which our Offload Engine resides,
and the role it completes is to send chunks of data to the Die-to-Die Adapter in the appropriate
format, along with the proper control signals. In the separation of the engine into separate
components emerges an AXI Engine, which drives the different AXI4 channels, receiving and
sending multiple streams of data simultaneously; a Protocol Engine to drive the FDI and
receive data from it while also taking into account the state of the link through the Die-to-Die
Adapter’s control signals; and finally the blocks between them which we have functionally
named Flit Construction and Flit Deconstruction. There will be subsections diving into
the details of these modules and the definitions we have created to serve the data translation
process. These definitions are none other than our own original Flit Architecture: a data
packet with distinct headers and bit fields optimized to service AXI. A complete, block-level
overview of our architecture, with each aforementioned component distinctly marked, can be
seen in the block diagram of Figure 3.1.
                                               33
34
     Figure 3.1: Offload Engine Complete Block Diagram
    The key feature of this design and our main challenge while developing it was the serializa-
tion of AXI channel data and maintaining its latency while also maintaining reliability. This
is done by keeping track of the transactions sent on the FDI to the UCIe Link through our
Transaction Table mechanism. We have to accommodate UCIe’s link mechanisms, utilizing
its control signals to apply backpressure to the AXI so that we do not lose any data. A Proto-
col State Machine is needed to mirror FDI’s state machine. Each block applies back pressure
to the block before it with a handshake mechanism. We found this method effective in the
Offload Engine’s functionality. Our Engine also employs various buffers throughout the design
to minimize data loss and accommodate the different data rates. Further contributing to the
quality of service is an added Credit Flow Control, in the Credit Handling block.
    The beauty of it is scalability. Key aspects of the engine are parameterized: Functional
parameters like the size of the buffers, the number of them, the AXI data size, AXI ID size,
but also design characteristics such as the maximum number of outgoing transactions, the bit
fields in the header, the supported AxSIZE. This made it incredibly easy to change the design
if necessary, making the design customizable for different application scenarios. The design
which we came to implement is after a series of calculations and trial-and-error which we will
include in the following sections.
3.2 Design
   This section will provide the in-depth design methodologies incorporated during our journey
of developing the offload engine. We will deep dive into the essential major subsystems that
make up the offload engine, which are the AXI Engine, Flit Construct/Deconstruct, and the
Protocol Engine. We will further provide the idea behind our design choices, novice design
implementation in terms of AXI-based Flits and the challenges faced in all the sub-systems.
   The AXI Engine is a crucial foundational component in the development of our offload
engine, acting as the essential interface connecting directly with the AXI bus in the system-on-
chip (SoC). This engine ensures efficient and reliable data transfer between various subsystems
within the chip, fully leveraging the robust AXI protocol. Our AXI engine design consists of
parallel AXI Master and Slave Interfaces, each further divided into dedicated write and read
channels. This approach adheres to the AXI protocol’s principles of parallel processing for
both read and write channels, ensuring optimal performance and reliability.
                                              35
   Another common mistake in AXI4 implementation is the improper handling of burst trans-
actions. The AXI protocol supports different burst types, such as fixed, incrementing, and
wrapping bursts, each with its specific use cases. Incorrectly configuring burst parameters
can lead to inefficient memory access patterns and degraded performance. To avoid this, our
design meticulously follows the protocol specifications for burst transactions, ensuring that the
appropriate burst type is used for each scenario and that burst lengths and sizes are correctly
configured.
   Managing data alignment is also crucial in AXI4 interfaces. Misaligned data can cause
additional latency and require extra processing to realign the data correctly. Our design
ensures that data is always aligned according to the AXI protocol’s requirements, minimizing
the need for realignment and optimizing data transfer efficiency.
    Handling outstanding transactions is another area where mistakes can easily occur. The
AXI protocol allows multiple outstanding transactions, but this flexibility can lead to issues
if not managed correctly. For example, exceeding the allowable number of outstanding trans-
actions can overwhelm the system, leading to stalls and decreased performance. Our design
incorporates throttling logic to restrict the number of simultaneous transactions, ensuring that
the module processes each transaction efficiently and within the system’s capacity.
   The AXI master interface within our offload engine is designed to handle high throughput
data transactions by leveraging a robust framework that ensures parallelism and data integrity.
The interface module is structured with independent address and data always block, allowing
for parallel processing of address and data signals, thus enhancing overall performance and
reducing latency. Both the AXI Master and Slave Interfaces are designed with separate al-
ways blocks for handling address and data transactions. This separation allows for parallel
processing, enabling the module to issue address signals and transfer data simultaneously.
The address always block manages the generation and processing of address signals within
the offload engine for the AXI bus, handling Address Write (AW) and Address Read (AR)
channels independently. The data always block is responsible for managing the data payload
pushed into the AXI Data bus, handling the Write Data (W) and Read Data (R) channels
separately, ensuring efficient execution of data transfer operations.
                                              36
the Flit deconstruction module, part of the Protocol Engine, with its handshaking mechanism.
This ensures that data and address information are correctly synchronized and validated before
being processed further in the AXI bus interface. Once validated and issued from the protocol
side, transactions are passed through to the AXI bus using the AXI protocol’s handshaking
mechanism, involving standard AXI signals to ensure that data is transferred correctly and
acknowledged by the receiving end.
   The AXI Slave Interface includes response logic that differs from the Master Interface,
requiring responses to be sent on different channels. After completing a write transaction,
the slave sends a response on the Write Response (B) channel, which is separate from the
Write Data (W) channel, ensuring that write operations can continue independently of the
response signals. For read operations, the slave calculates the response and sends it along with
the read data on the Read Data (R) channel, ensuring that data and response are delivered
together, simplifying the read transaction process. The addressing logic in the AXI Slave
Interface computes the next address based on control information received from the Address
Write (AW) and Address Read (AR) signals. This logic ensures an accurate determination of
the target address for each transaction, maintaining data integrity and synchronization.
                                               37
                 Figure 3.2: AXI Interface in the Offload Engine framework
    Early on in the design, it was clear that a Flit specific to our offload engine would have to
be designed. This flit would provide us with all the necessary bit fields to achieve the target
reliability. It is also necessary to properly decode the outgoing and incoming data. For this
purpose, we defined 4 distinct flit types which we will present in this section. We will also
present and discuss the modules responsible for encoding and decoding the information to and
from the flits. The total size of the Flit, stored as the parameter FLIT_CHUNK in our design,
stems from calculations we did based on AXI limitations and UCIe definitions. We will also
discuss these in the following section.
                                             38
Flit Formats
   The two headers for our flits can be seen in Figure 3.3. All types have the Standard Header
at the head of the flit. The AXI header is reserved only for the ARAW flit, and it is essentially
a collection of all the AXI address read/write signals, which have pre-defined widths by the
AXI specification [27]. The Standard Header however consists of our bitfields:
   • TID: Transaction ID. Define the type of Flit. Its encodings can be seen in Table 3.1.
   • Tag: Unique Identifier for a transaction.
   • ID: The associated AXI ID.
   • Credit: The credit handling is explained in detail in Section.
   Except for the AXI signals found in the AXI header, every other bit field’s length is a
parameter which may be altered within the bounds of the total header size.
   AR/AW Flit
   The structure of the AR/AW Flit can be seen in Figure 3.4.
                                              39
   From Most Significant Bit to Least Significant, first we use the Standard Header which was
previously defined, then the AXI Header. A single-bit field follows to indicate if the Flit is for
a Read (0) or a Write (1). The rest of the bits are not in use, and they must be 0.
   Payload Calculations for Read and Write Flits
   The AXI standard provides specifications for the format in which read or write data can be
transmitted. The granularity of transmitted data is defined by AxSIZE, defining the size of a
beat. The number of beats is defined by AxLEN. These two fields are 3 and 8 bits respectively.
But beyond the limits imposed on them by their bit size, there is one more limitation: the
transmitted data cannot surpass 4 KB. Because of this limitation, we can know how
many transfers are necessary for a transaction of the maximum size. For a given Standard
Header size and with the FLIT_CHUNK parameter predetermined by the FDI width, by
tweaking the Sequence field (the order of a flit in its specific transaction) we calculated how
many flits are needed for every combination of AxLEN, AxSIZE which gives the maximum
amount. These numbers are defining for our architecture in that they define the size of buffers
throughout the entire design. In Table 3.2 we summarize these calculations for FLIT_CHUNK
= 64 Bytes.
   For this fixed FLIT_CHUNK we note that in AxSIZE cases 7 and 8 the 8-bit Sequence
field is not big enough to represent the flits in a maximum data transaction properly. This is
why we had envisioned a split as seen in Figure 3.5, and split the beat within the flits into
halves for case 7 and quarters for case 8. However, we did not have the time to implement
this mechanism into the hardware but rather saw it as an expansion of this work.
Figure 3.5: Utilizing the Sequence Field for AxSIZE above 32 bytes.
                                               40
  R Flit
The structure of the R Flit can be seen in Figure 3.6.
    Besides the Standard Header, the payload is preceded by two more fields, the Sequence field
which holds the order number of the flit in the current transaction, as well as a Start/Stop
bit, the encodings of which are shown in Table 3.3 and remain the same for the W Flit. It
is necessary to have this for the receiving engine to know when a transaction has finished, in
other words, which flit is the last in the transaction.
   The reason that in Table 3.2 there are different calculations for Read transactions is that
each beat in a Read flit is accompanied by the 2-bit AXI RRESP, or Read Response, field.
Meanwhile, in a Write Flit, each bit is accompanied by the WSTRB, or write strobe field,
which is of varying size according to the AWSIZE. This leads to different results when it
comes to the number of beats which can fit in a flit.
  W Flit
The structure of the W Flit can be seen in Figure 3.7.
   The fields are similar to the R Flit, which are explained above.
  B Flit
The structure of the B Flit can be seen in Figure 3.8.
   Besides its Standard Header, it carries the 2-bit BRESP write response. Most of the
space in the flit is unused. At the beginning of the flit design process, there were discussions
on combining different Write Responses in one flit, but ultimately the idea was abandoned
because it would lead to a large delay for individual B Flits as well as unnecessary design
complexity.
                                             41
                               Figure 3.8: Write Response Flit
                                              42
the data read from memory along with necessary metadata. In summary, the Flit Construct
Master module effectively translates AXI master signals into structured flits, including B and
R, thereby enhancing the system’s data management and communication efficiency.
   In conclusion, the Flit Construct Slave and Flit Construct Master modules are pivotal to
the effective management of data flow within the system. They handle the detailed aspects of
flit construction, including formatting, tagging, and synchronization, thereby contributing to
the overall efficiency and functionality of the system.
   The Protocol Engine component of the Offload Engine is the part which communicates
with the Die-to-Die Adapter. It has to partake in the Initialization flow and halt the trans-
                                               43
mitting/receiving process when the Adapter requires it. It must also allow the AXI engine
to seamlessly transfer data to the AXI bus, delivering the data from FDI to it without inter-
ruptions and in the correct order. We will present the mechanisms which ensure these in the
following sections.
   Protocol Engine components are grouped into three different categories: the Transmitter
group, the Receiver group, or a third category which consists of blocks used by both incoming
and outgoing flows. In this category belongs the Credit Handling, Transaction Table, and
Protocol Control Block. We will look at all of these separately.
    It is only fitting that we start with the Protocol Engine’s brain, the Protocol State Machine,
carefully enclosed in the Protocol Control Block. The Protocol Control block is connected with
all of the FDI’s signals, except the main data path and its handshake pins, which are connected
to the Transmitter and Receiver. However, not all signals are driven or used. For example,
the sideband is not in use and is outside of the scope of this thesis. Documentation of the FDI
signals which are relevant to the design will first be presented, along with a short explanation.
Then we shall discuss the Protocol State Machine. The signals are defined in the following
Table 3.5 and Table 3.4.
   Now that all input and output signals have been defined, we can present the Protocol State
Machine. The PSM has been implemented to mirror and comply with the FDI’s state machine,
as was previously shown in Figure 2.15. Fully complying with the specification, all transitions
we implemented in the PSM involve the appropriate signals. In this work, we will not delve
into all state transitions, but rather only focus on the Initialization Flow. The Protocol State
Machine can be seen in Figure 3.9, including all the signals involved in the different transitions.
   1
     We do not implement clock gating for low power, however, do acknowledge this, since it is recommended by
the spec: Implement the handshake logic in your Protocol Layer even if you don’t perform clock gating; When
receiving pl_clk_req, simply acknowledge it with lp_clk_ack after one clock cycle. This approach ensures
compliant behaviour with the Adapter and avoids potential errors.
                                                     44
Signal name                 Signal Description
pl_trdy                     Signifies that the Adapter is ready to accept data. Is used in a
                            handshake for the Transmitter.
pl_valid                    Signifies that the Adapter is sending valid data on pl_data. Is
                            connected to the Receiver of our engine.
pl_data        [NBYTES-     Adapter to Protocol Layer data. NBYTES determines the width
1:0][7:0]                   of the FDI interface. NBYTES is directly related to our design
                            parameter FLIT_CHUNK.
pl_flit_cancel               Signifies that the flit should be dumped by the Protocol Layer.
                            This comes one cycle after the first transmission on the FDI, which
                            is why we keep pl_data in an intermediate buffer.
pl_stream[7:0]              Adapter to Protocol Layer that indicates the stream ID to use with
                            data. It has 8 encodings currently in use by UCIe, and each stream
                            ID value maps to a different protocol for Stack 0 or Stack 1. We
                            are only concerned with Streaming Protocol, so depending on the
                            connected Stack the value should be 04h or 14h.
                            Shows the state of the Interface state machine. The encodings are:
                            000b: Reset
                            0001b: Active
                            0011b: Active.PMNAK
                            0100b: L1
pl_state_sts[3:0]
                            1000b: L2
                            1001b: LinkReset
                            1010b: LinkError
                            1011b: Retrain
                            1100b: Disabled
pl_protocol[2:0]            Adapter’s indication of the protocol that was negotiated during
                            training. The value we are expecting is 0111b: Streaming protocol
                            without Management Transport.
pl_protocol_flitfmt[3:0]     Signifies the negotiated Flit Format. The value we are expecting
                            is 0001b: Raw Format.
pl_protocol_vld             Indicates that pl_protocol and pl_protocol_flitfmt have valid in-
                            formation. When this is high, those signals must be stored. This
                            signal is also a part of our PSM’s initialization process.
pl_inband_pres              Signifies that the Die-to-Die Link has finished parameter negotia-
                            tion. Part of the Initialization process.
pl_stallreq                 Adapter request to the Protocol to Flush all flits and not prepare
                            any new Flits. Accompanies a transition to a low-power state
                            triggered by the Adapter.
pl_wake_ack                 Signifies that clocks have been ungated in the Adapter. Is part of
                            the Initialization process.
pl_rx_active_req            Signifies that the Protocol should open its Receiver Path to receive
                            new Flits. Is only valid within the Reset, Retrain, or Active state.
pl_clk_ack                  Request from the Adapter to the Protocol Layer to remove clock
                            gating from its logic 1 .
Table 3.4: Signals from the Die-to-Die Adapter to our Protocol Engine
                                              45
Signal name                 Signal Description
lp_valid                    Signifies that the Protocol Layer is sending valid data on lp_data.
                            Is generated from the Transmitter of our engine.
lp_irdy                     Signifies that the Protocol Layer potentially has data to send, is
                            asserted together along with lp_valid.
lp_data        [NBYTES-     Protocol Layer to Adapter data. NBYTES determines the width
1:0][7:0]                   of the FDI interface. NBYTES is directly related to our design
                            parameter FLIT_CHUNK.
lp_stream [7:0]             Protocol Layer indicates to the Adapter the stream ID to use with
                            data. We have tied this signal to 04h: Stack 0, Streaming Protocol.
                            Protocol Layer request to Adapter to change state.
                            The encodings are as follows:
                            000b: NOP
                            0001b: Active
lp_state_req[3:0]           0100b: L1
                            1000b: L2
                            1001b: LinkReset
                            1011b: Retrain
                            1100b: Disabled
lp_linkerror                Protocol Layer to Adapter indication that an error has occurred
                            which requires the Link to go down, essentially requesting the state
                            to change to LinkError. We have implemented the logic for our
                            design but do not define a use case for it.
lp_rx_active_sts            Response to pl_rx_active_req to show the Receiver is enabled.
lp_wake_req                 Request from the Protocol Layer to the Adapter for it to remove
                            clock gating from its logic.
lp_stallack                 Response to pl_stallreq. Is asserted during the transition to a
                            low-power state.
lp_clk_ack                  Response to pl_clk_ack.
Table 3.5: Signals from our Protocol Engine to the Die-to-Die Adapter
                                              46
                             Figure 3.9: Protocol State Machine
   For the Initialization flow, we have defined 3 states which aid in the process and are defined
by the handshakes and signal exchanges that have to be completed before FDI and our engine
can enter the Active state. During the Active state, the Transmitter and Receiver are sending
and accepting flits respectively. The Initialization Flow is as follows:
  1. Adapter notifies the Protocol Layer that parameter exchange has finished
     by asserting the signals pl_protocol_vld and pl_inband_pres. This is the beginning of
     Part 3 of Stage 3 of the entire flow we explained in 2.5.3. These assertions trigger the
     PSM to enter the R_Active_Entry_Wake state in which we raise lp_wake_req and
     request an Active state on lp_state_req.
                                             47
  2. Adapter has removed clock gating after our request. This is indicated by the
     assertion of pl_wake_ack which triggers our PSM to enter R_Active_Entry, during
     which it waits for the Adapter to reflect Active in pl_state_sts.
  3. Adapter and PSM are both in Active state after pl_state_sts has switched to
     active. In this, credit flow control, transmitting and receiver are enabled in the Protocol
     Engine.
  At the end of this flow, known as the FDI Bring up Flow, our engine is in Active mode
and is accepting and receiving flits to and from the Adapter.
The Transmitter
   The existence of the transmitter buffers is inspired by PCIe’s similar transmit structure.
These buffers hold flits until it is permitted to transmit them as a way to alleviate data loss.
The signal that enables transmission is a combination of a transmit enable from the PSM, and
a transmit enable from the Credit Handling block presented in section 3.2.3. There are two
buffers in the Transmitter, which have their respective Credit Handling Blocks.
   Transmit Buffer TX_S, connected to the AXI Slave interface Flit Construction Block.
This Buffer stores AW, W, and AR flits. The total size of this buffer is equivalent to the
total size of the AR, and WAW Buffers in the Receiver of the Offload Engine in a symmetric
connection: a symmetric connection is the scenario where our Offload Engine is connected to
an Offload Engine with the same buffer sizes and tag spaces. This is the only scenario which
we tested. However, in a theoretical asymmetric connection, the transmit buffers would have
to reflect the size of the remote Offload Engine’s Receiver buffers.
   Similarly, Transmit Buffer TX_M is connected to the AXI Master Interface Flit Con-
struction Block. In this buffer, B and R flits are stored.
        TX_M_SIZE = R_RECEIVER_BUFFERS*MAX_R_TRANSACTION_SIZE +
                         B_RECEIVER_BUFFER_SIZE
   TX_S_SIZE = WAW_RECEIVER_BUFFERS*(MAX_W_TRANSACTION_SIZE+1) +
                      AR_RECEIVER_BUFFER_SIZE
   For the cases we tested, the sizes that were used were 516 and 520 flits wide respectively.
This was to reflect a maximum transaction size of 128 and a tag space/transaction throttle of
8.
   The Transmitter is connected to the FDI. In our design, we were working with an FDI of
128 to accommodate the raw transmission of two 64-byte flits simultaneously. The Transmitter
will signify to the FDI that a new flit is to be transmitted whenever either Tx_S or TX_M
has flits to send and transmission for them is enabled. If one transmission buffer is to transmit
and the other isn’t, the empty half would be filled with a DEFAULT_DATA field which in
turn would be detected by the remote Offload Engine’s Receiver and discarded (similar to
UCIe’s NOP flit). This emphasizes the symmetrical connection between two Offload Engines.
   The handshake between the Transmitter and the FDI is defined by UCIe specification and
can be seen in Figure 3.10.
                                              48
Figure 3.10: Data Transfer between the Transmitter and the FDI, or the Protocol Layer and
the Adapter as shown in UCIe Specification [26].
   In our implementation, lp_irdy is enabled when the Offload Engine is in an Active state
and lp_valid when the Transmitter has valid data to transfer. Once pl_trdy is asserted and
the Adapter has accepted the data, the respective Transmit buffers are updated and so is their
Credit Handling.
   The need for a transaction recording mechanism arose from the AXI specification. In AXI,
two transactions with the same ID may be of a completely different type with a completely
different destination. The lack of restrictions on the AXI ID led us to the construction of the
Tagging System.
   The Tagging System’s purpose is to keep track of completed transactions. It is compliant
with the fact that we have a limit to the number of outbound transactions the AXI engine can
manage. From AXI, the only identifier we have for a transaction besides its type is the ID -
which be used by different transactions. There might be a conflict between the transactions
that are being sent from our engine, and also between the transactions that are being sent
and received. To solve this issue we introduce the notion of a tag, a unique identifier for each
transaction sent through the UCIe link. As we saw in Figure XX, every type of flit has the
tag bit field in its header.
   We use the term Transaction Table as a storage for the information of our outbound and
inbound transactions.
   In this table, we store:
   • Tags of outbound and inbound transactions
   • IDs so that transactions from inbound AXI can be identified and tagged
   • ArSIZE, ArLEN so that R transactions can be properly decoded or encoded
   When a transaction is received, it will be checked on the Transaction Table before it is
accepted to the Rx buffer. The number of Rx buffers is also equal to the number of distinct
tags an offload engine can support which is equal to the number of outbound transactions an
offload engine can support. The tags are used for addressing the transaction table. This is
chosen so that searching in the table does not take multiple clock cycles.
   The table is partitioned into four sub-units of storage, and an entry to the transaction table
goes into one of these units depending on if it is inbound or outbound, and Read or Write.
                                               49
The sub-units are then defined as follows:
   • Tx_R, for outbound Read transactions
   • Tx_W, for outbound Write transactions
   • Rx_R, for inbound Read transactions
   • Rx_W, for inbound Write transactions
   When a transaction is completed, either by AXI data arriving to the Flit Construction
block or by a flit arriving to the Receiver, an entry is erased by setting its dirty bit to 0 so
that slot, therefore that tag, can be freed and used again.
   The architecture and form of the Transaction Table was conceived by the need to keep
separate storage and tag spaces for Write and Read transactions so that there is no overlap
between their tags. The tag space of one Offload Engine is mirrored in its remote link partner.
An example of different transactions and their completions is shown in Figure 3.11.
                                              50
Figure 3.11: 3-stage Tagging System Example
                    51
  In this example, we show how the system handles multiple transactions with the same ID,
and how they are stored and completed in the Transaction Table.
Credit Handling
  In two instantiations of the same block for the two distinct Transmit buffers, we implement
Credit Flow Control for our Offload Engine. We follow methodologies used in PCIe [30],
which we will explain below.
   Every Credit Control block has a register which counts Credits Consumed. This shows
the number of credits consumed in the remote Engine’s Receiver Buffers, or in other words,
is an estimate of how much data we have sent to that buffer and is stored in it. In a more
complex system where packets are of varying sizes, they can consume a different amount of
credits. But here since our flits are all of the same size, a credit to flit ratio of 1 is sufficient.
Thus, during reset, the credits consumed counter is initialized to 0 and its maximum limit is
TX_M_SIZE or TX_S_SIZE as described in the Transmitter Section 3.2.3. Before a flit is
sent to the FDI, a calculation is made:
  Cumulative Credits are an updated estimation of the remote partner’s credits after trans-
mission. CREDIT_FIELD is the credit field’s size parameter. The next inequality check is
what enables transmission for the Transmit Buffers:
                                                         2CREDIT_FIELD
                       cumulative_credits_required ≥
                                                               2
   If it holds, transmission is enabled for the associated buffer. The right field of the inequal-
ity is not 0 because while the subtraction for the cumulative credits is implemented in 2’s
complement, the comparison is unsigned. If the result for the cumulative credits is negative,
the inequality will not give a proper result.
   The Credit handling block holds another counter for Credits Allocated: this is a real
representation of the credit space in the Offload Engine’s Receiver buffers. It is initialized
to either TX_M_SIZE or TX_S_SIZE. Whenever a flit is received from the FDI of the
appropriate type (AW/W/AR or R/B), this amount is decremented. Whenever a flit exits
either the AW/W/AR or R/B buffers and is sent to Flit Deconstruction, the associated Credits
Allocated counter is incremented. This value is available to Flit Construction and is placed
in the Credit field found in the Standard Header of all flits. Outgoing R/B flits will have the
transmitting Offload Engine’s accumulated credits of Credit Control M and AW/W/AR flits
will have that of Credit Control S. When received at the remote Offload Engine they will be
used to update that Engine’s Credits Consumed. This is how that estimate is updated and
remains true to the actual value of the remote partner. In PCIe, credit update is done through
messages. For UCIe it could be done through the sideband, but we do not utilize it in this
implementation.
                                               52
The Receiver
    The Receiver is the largest block of the entire Protocol Engine. We will explain the function
of it by going through the different stages that A flit goes through upon arrival from the FDI.
   • Flit Arrives and is stored in the Intermediate Buffer. This allows the Adapter
     to cancel a flit. Flit cancel happens when a piece of the flit does not pass the CRC check
     in the Adapter, and the signal is asserted after the flit chunk has been sent on the FDI.
     An example provided by the UCIe spec is shown in Figure 3.12 where the second half
     of the latter two 64-byte flit chunks do not pass the CRC check. The flit cancel signal
     is asserted again for the first half of the flit to avoid sending the first half twice. This
     mechanism is the reason we store the FDI transmission in an Intermediate buffer.
   • Flit is separated into M and S for Transaction Table Check. The flit is divided
     between AW/W/AR and R/B flits as they were transmitted by the remote Offload
     Engine and stored in two intermediate buffers. Their tags are sent to the Transaction
     Table so a check can be made: if the received flit is a completion to previously transmitted
     flits then the Transaction Table will be updated. Otherwise, if the incoming flit’s tag
     is not in use, it can be recorded in the Transaction Table’s inbound transactions. If it
     does not fit either criterion, the flit has to be discarded. In this stage, the data in the
     intermediate buffers is also checked for validity by comparing it to DEFAULT_DATA
     as we explained in 3.2.3.
   • Transaction Table Check Done. Look-up in the transaction table takes one cycle.
     If the Flit is valid in the tag space, it is allowed to pass through to the Receiver Buffers.
     In the case of an incoming R Flit, the Transaction Table also returns the ArSIZE, and
     ArLEN so that they may be stored alongside the data in the buffers and used in Flit
     Deconstruction to decode the flits.
   • Stored in Flit Buffers. We define 4 types of Receiver Buffers:
        – AR Buffer. Stores incoming AR flits. It is implemented similarly to the Tx buffers,
          as a FIFO (First-In-First-Out).
        – WAW Buffers. Store Incoming AW and W flits. The number of WAW buffers
          is defined by the tag space: each WAW buffer corresponds to one tag value. The
          decision to store the AW Flits along with the corresponding W Flits was to maintain
          reliability. Once the transmission of W data starts towards the AXI engine, it may
          not be interrupted. To maintain this rule we implemented the following mechanism:
          A WAW Buffer is of MAX_W_TRANSACTION + 1 size. In address 0, the AW
          with the specific tag is stored. Then, the incoming W Flits are stored in an address
          according to their Sequence field. Thus, the whole transaction is assembled in order.
          This is to combat any re-ordering that may happen in the Link, but also so that
          the transaction is pushed out properly.
        – R Buffers similarly store the R transaction, utilizing the Sequence field to keep
          the order of transmitted data. Alongside the data, ArLEN and ArSIZE are stored.
        – B Buffers store Write Response Flits, implemented as a FIFO.
     The R and WAW Buffers operate with an FSM. In one state, they are filling up and
     assembling the transaction. When the transaction is completed, they go into a pushout
                                               53
                          Figure 3.12: Flit Cancel example [26].
     state: during which they must push out their data to the corresponding Flit Deconstruc-
     tion Block, uninterrupted.
   • Flit Deconstruction and Arbitration. For AR and B Flits, it is fairly simple to
     connect them to their corresponding FD blocks. For R and WAW Buffers, which are
     multiple according to the tag field, arbitration must take place. During every cycle in
     the Receiver Engine, the Buffers are checked to see if they have completed transactions.
     If they do, they will claim the appropriate FD. The check is done with numerical tag
     priority, starting from tag 0. This is because the tags are similarly assigned by the
     Transaction Table, so the priority is mirrored here.
  This concludes the function of the Receiver Engine as flits exit the block and the Protocol
Engine altogether, making their way to Flit Deconstruction and ultimately to the AXI blocks.
                                            54
3.3     Challenges
  In this section, we will discuss the design challenges we encountered and the design choices
implemented by looking at early iterations of the major aspects of the design, the Offload
Engine architecture and the AXI-based flits.
   The early iterations of the offload engine focused solely on bridging the functionality of
the offload engine with the AXI protocol framework inside the System-on-chip (SoC) and the
Universal Chiplet Interconnect express(UCIe).
   We will first, focus on the AXI interface, The AXI blocks were developed crudely to cater
to the AXI channel signals from the AXI bus. Both the AXI Master interface and the AXI
Slave interface shared the same AXI FSM handshaking mechanism with a Configurations
Space block that managed to coordinate between the two interfaces. The challenge with
this design was the not utilizing the AXI4 parallel address/data channels for both
the write and read transactions. The iterations of the AXI blocks focused on developing
separate AXI interfaces with independent channel AXI handshaking mechanisms as the AXI4
protocol demanded changes from the design, where the protocol states that the write and
read transactions have separate address, data and response(in the case of write transactions)
channels that synchronize there signals but operate independently to increase the flexibility,
scalability and to optimize latency. The creation of AXI Channel buffers was an addition to
the early iteration of the design to maintain the parallel processing logic generated by the AXI
interface, where the transactions are stored and maintained in the buffers till they are moved
to the Flit Construct (FC) block where the AXI-based flit is created to be pushed out to the
FDI(Flit Aware D2D interface). These changes on the AXI interface were crucial in moving
the design closer to comprehensive AXI4 protocol support.
    The Protocol Engine has undergone significant change in its development through its several
iterations from the initial one seen in Figure 3.13. The general structure of this protocol engine
was clear early on, borrowing from pre-existing architectures such as PCIe. One of the main
challenges was limiting area by reducing buffer sizes, only keeping that which was necessary
while maintaining a coherent structure, meaning, the credit flow control had to reflect this
size. Beyond this, we were called to solve problems unique to our design case. The AXI
protocol requires data to be delivered in a stream that may not be interrupted. To overcome
this challenge, we constructed two mechanisms: the Transaction Table and the architecture
of the Receiver Buffers. The Transaction Table ensures the completion of AXI transactions,
while the indexed nature of the Receiver buffers makes sure the stream of data towards the
AXI is uninterrupted. Separation of Receiver Buffers according to the type of flit content also
aided in this solution. The reader may refer back to Figure 3.1 and compare it to the initial
design to get a sense of the requirements needed to be met.
   The Flit construct/deconstruct blocks are the final block that makes up the offload engine.
Both the blocks were designed keeping in mind the single input from the AXI interface side
and the protocol side. The development of the AXI interface and the protocol engine as
mentioned above created challenges in reducing the parallel processing logic for the
Flit Construct/Flit Deconstruct. This was resolved by creating multiple instances of the
blocks and increasing the data path towards the D2D interconnect on the FDI bus.
                                               55
                         Figure 3.13: Initial Offload Engine Design
   The AXI-based flits were initially developed keeping in mind the types of AXI transactions,
the credit system for receiver and transmission buffers which is where the virtual channel (VC
ID) was added to the flit structure. The challenge was managing the overhead to payload
ratio as well as the lack of structure which we had to innovate upon. In the first generation
of flits, we had an absence of a unified standard or AXI field, which helps the decoding
aspect of the Flit. We overcame these challenges by changing and adding fixed standard and
AXI headers into the flits. Furthermore, the addition of the transaction table helped add more
                                             56
dynamics to the flit like adding the Start/stop bit and a tag field.
   It was also realized that the sequence number of a flit had no explicit correlation to the
AxLEN. At this time, we made the calculations necessary, given in Table 3.2. After establishing
a strict Standard Header size, we made the appropriate trade-offs within the header to finalize
the flits shown in section 3.2.2.
                                              57
Chapter 4
Results
  In this chapter, we will first discuss the objective that we are trying to achieve through our
comprehensive verification process. We will then go through the detailed process of verification
from the block-level to system-level verification methods. Additionally, we will present the
supportive scripts which aided in the verification process. This overview aims to explain to
the reader the methods which were used to examine and complete the design.
4.1 Verification
   The Offload Engine serves as the critical integration unit between the System on Chip
(SoC) and the die-to-die package playing a pivotal role in the chiplet architecture. In the
larger context of the chiplet architecture, the Offload Engine is responsible for replicating
signals from the AXI bus on one side to the other side within a different chip, ensuring
seamless communication and data transfer across the chiplets.
    To validate the functionality and performance of the Offload Engine, we adopted a multi-
tiered verification approach, starting with block-level verification and culminating in system-
level verification.
   The two distinct domains of the Offload Engine led to diverse verification methods through-
out the whole design. Below we present the different methods used for verification in these
two domains, AXI and Protocol, as well as the bridging FC and FD modules.
                                              58
   In the beginning, an AXI block-specific test bench where developed to test the AXI core
blocks. Advanced eXensible Interface (AXI) is a complex communication framework within the
SoC which communicates between multiple subsystems. To extensively test the AXI protocol,
a set of distinct test cases must be implemented to cover all the corner cases. The following
cases were verified for the distinctive write and read transactions:
   • Single Address and single beat data transfer
   • Single address and multiple beat data transfer (Burst mode)
   • Multiple addresses and multiple beat data transfer
   • Exhaustive transaction transfer to verify the outstanding transaction logic
  These test benches exhaustively verify the different transactions that can take place in write
and read channels and also the accuracy in timing for the AXI handshaking mechanism.
   The AXI channel buffers are the next unique block in the AXI Engine, which are verified
to correctly intake and maintain the AXI address/data information in the respective channel
FIFO buffers. The data integrity checks are ever so important when it comes to the read
data and write data buffers as they are an array of FIFO buffers that are being tracked by
an external InUse buffer. Thus the testbench caters to implementing the multiple write and
multiple read operations one after the other as well as simultaneously. For not losing parts of
data information when in case of burst mode the logic can only access the specific data once
the last beat is stored in the buffer thus ensuring no data loss while read operation.
   The next stage test benches encapsulate AXI Read/Write blocks and AXI buffers for AXI
Master/Slave Interfaces to verify the AXI flow from the AXI bus interface till the Flit Construct
block. These test benches were very important to correctly send the huge number of AXI
signals through the block and the design to be able to maintain and store information in the
respective channel buffers with no external stimuli in the AXI logic.
   The final AXI Engine stage testbench encapsulates the AXI Read/Write blocks, AXI chan-
nel buffers and the Flit Construct(FC) blocks together to generate an end-to-end AXI signal
encapsulation from AXI signal to AXI flit. This helps to understand the credibility of the data
transfer through the FDI for D2D interconnects like UCIe.
   Testing the various core AXI blocks and building the testbench up to the AXI interface
significantly improved our understanding of how all the blocks function together, leading to
the creation of a robust AXI interface module.
                                              59
data of an R or W flit is stored in the format of a list of their beat values, which also are
translated into binary code in binarify() based on the flit’s AxSize. Utilizing the above, we can
make multiple transactions of any type. Figure 4.1 it is shows how the script data is generated
and then used in one of our test benches.
Figure 4.1: How Flit stimulus is generated and used in block level verification
   The script can output FDI length or 128bytes of data for verifying the Receiver Engine with
DEFAULT_DATA padding if necessary. It can also scramble the order of the flits to simulate
data scrambling on the UCIe Link. The script can also output FLIT_CHUNK or 64byte
size data for verifying the Transaction buffer, or Flit Deconstruction blocks. Throughout the
verification process, the flit stimulus generation script became highly versatile.
   Verifying the Receiver
   Verifying the receiver meant verifying all the steps explained in section 3.2.3. Initially, when
the Transaction Table had not been verified yet, it was simulated via a task on the Receiver’s
testbench. Using the flit stimulus generator script, the following cases were verified:
   • Two consecutive Write Data Transactions of a varying number of Flits
   • Two consecutive Read Transactions of a varying number of Flits
   • Multiple B and AR transactions
   • Multiple Read, Write, B and AR Transactions
   • Multiple Read, Write, B and AR Transactions Scrambled
   The payload of the Read and Write transactions was simply an incrementing integer, which
started from 0 and ended with AxLEN- this helped us recognize the proper function of the
Receiver Engine. Gradually building up to more complex cases helped in verifying the individ-
ual blocks of the Receiver, namely the Receiver buffers. Though not verified on the industrial
level, we can now confidently say that the Receiver is equipped to handle multiple transactions
of any type.
   Verifying the Transmitter
   During the process of verifying the transmitter it was important to see two things:
                                                60
   • Proper Function of the Transmission Buffers, which were implemented using a common
     block called Flit_Buffer.sv as a FIFO. This block is also used in the Receiver Engine for
     AR and B Flits.
   • Proper assertions of control signals to the FDI and Credit Handling
   Once these were ensured, we could move on to the next block, and later on to the verification
of the whole system. For this purpose, we proceeded straight to the multiple-type flit case we
used in the Receiver which proved the correct functionality for the block.
   Verifying the PSM
   For the Protocol Control Block, we focused on the Bring Up Flow, thus, it was necessary
to drive the PSM properly and according to the UCIe Specification as an Adapter would.
Complying with the correct signal transitions as described by UCIe we were able to see the
PSM transition from Reset to Active state and assert the appropriate signals.
   Verifying Credit Handling
   For this Block, it was important to see:
   • credits being exhausted, and transmission being disabled.
   • Proper updating of the Credits Allocated from a remote partner
A simple task-based testbench was needed to toggle the control signals. During this process,
we also evaluated the equations brought from the PCIe specification and understood their
operational details. The relation between credits allocated, credits consumed and the size of
Transmission Buffers was also evaluated. In an older Flit design, where the CREDIT_FIELD
parameter was 8, it was not sufficient for the size parameters we were working with. Thus we
decided to change it to 12, reducing the tag field. After all the changes, it was confirmed that
the Credit Handling blocks functioned properly.
   Whole Protocol Engine Verification Once all the above blocks were verified, the entire
Protocol Engine was put together with the FD to ensure proper connections for the particular
domain. From FDI to FD, output was verified. This made for a smoother transition to whole
System verification.
Cross-Domain
Verifying the Transaction Table
   While the transaction table was still being developed, a simulation was needed to truly
understand the different scenarios and the structure of the tagging system. We developed a
transaction emulator to aid us in the development of the Tagging System, which can be seen
in Figure 4.2.
                                              61
                           Figure 4.2: Transaction Emulator Script
    The Tagging System mechanism, along with the Transaction Table architecture we arrived
is the one shown in Section 3.2.3. To verify this block we developed a testbench where differ-
ent transactions were tasks, asserting the appropriate control signals that were input to the
Transaction Table. What we focused on seeing was:
   • Different transactions are being recorded on the Tx tables, and the Transaction Table
     outputting the tag for the FC
   • Different transactions being recorded on the Rx table, and the Transaction Table out-
     putting the correct information for the Receiver
   • The completion of transactions from both ends, either by inputting an ID from the FC
     or by giving a tag from the Receiver
   The one-cycle operation of the Transaction Table was also crucial, so we took care during
the design by implementing the hardware to execute in parallel so that it remained true.
   Verifying The Flit Deconstruction Blocks
   For each Flit Deconstruction Block, a testbench was created to verify the different opera-
tions of the block. For the AR and B FD blocks it is fairly simple to test the transmission of
multiple AR and B flits back to back to the AXI block. Again, the flit stimulus is used. For
the WAW and R blocks, we have to ensure the proper state transitions. Flit stimulus is again
used, taking advantage of previously generated cases for the Receiver engine. Once the proper
function of FC is ensured, we can move on to system integration of all domains.
   Verifying The Flit Construction Blocks
    Flit Construction blocks are divided into Master/Slave FC blocks for the respective AXI
Interfaces. The Flit Construct block operates on a Finite State Machine mechanism, where
the states are IDLE, Flit-Initiation, Flit-Tag and Flit-creation, the functionality of the states
is explained in detail in the method chapter. The testbench to test the respective Flit con-
                                               62
struction generates the stimulus for all the 5 channel AXI buffers and generates the correct
Tag stimulus from the Transaction table during the Flit-tag phase of the code to functionally
verify the Flit Construct blocks comprehensively.
   After all sub-systems were verified using the methods previously described, it was time
to integrate the system and build the whole simulation environment. We decided that since
the function of the Offload Engine was symmetrical, it would be sensible to connect two
instantiations of the Offload Engine in this environment. To simulate the Adapter through
the FDI, a Die-to-Die Adapter module is also necessary to drive the UCIe control signals. The
whole structure can be seen in Figure 4.3.
AXI Driver
   Different AXI channels can be simulated using tasks in the encompassing test bench. Each
Transaction is a toggling of the appropriate signals that are connected to the AXI Domain of
the Offload Engine, and in this way, we can drive them from Device 1 and see them at Device
2’s output. A simulated response can then be driven from Device 2’s AXI Interface.
  For the Bring-Up Flow Driver, a technique from the PSM block verification is used.
The bring-up flow sequence is:
   • Clock cycle 1: pl_inband_pres and pl_protocol_vld are asserted.
   • Clock cycle 2: pl_wake_ack and pl_rx_active_req are asserted.
   • Clock cycle 3: pl_state_sts is set to 0000b, the value for Active.
   This follows the rules for UCIe FDI bring-up flow, defined in the specification [26]. This
Driver is connected to both devices and initiates the flow for them simultaneously.
                                             63
   In the FDI Data Driver component of the Adapter Simulation Module, we need to drive
the data handshake signals for the Transmitter and Receiver of both devices. To simulate the
link, we include two FIFO buffers: one which stores flits from Device 1 to 2, and the other
which stores flits from Device 2 to 1. Once a Flit enters either FIFO, the module sends the
Flit to the receiving device by mounting it on the corresponding pl_data bus and raising the
associated pl_valid. The module will keep accepting flits from either engine so long as the
respective FIFO buffer is not full.
4.1.3 Results
   In this work, we focused on the functionality of the design and tested the elaborate mech-
anisms we developed for the architecture. For this purpose, a simulation of the design is
sufficient. Power and Area metrics may be obtained from the synthesis of the design, but this
is a long and arduous process which goes beyond the scope of our work. However, we can
present the latency and throughput of the Offload Engine.
   The latency of the Offload Engine may be measured by the simplest type of completion
which is an AR-R completion, or the transmission of an AR transaction from Device 1 to
Device 2 and the R data response from Device 2 to Device 1. The results of this simulation
can be seen in Figure 4.4.
  FDI Bring-Up takes 3 cycles. From then on, an AR transaction is mounted on the Slave
AXI AR channel of device one.
   • Within 10 cycles it passes through Device 1 and is given to its Adapter.
   • After arriving at PL_DATA of Device 2, it is shown on the AXI output after 12 cycles.
Device 2 then receives the R DATA on its AXI Master Read. It consists of 5 beats.
   • In 16 cycles it passes through Device 2 and is sent on the FDI. If we do not account for
     the 4 cycles it takes for the multiple beats of the transaction, the latency of this stage is
     12 cycles.
                                               64
   • Upon arrival from the Adapter, it takes 16 cycles to finally arrive at the output of
     Device 2, AXI Slave Read.
   In total for this scenario we have a latency of 50 cycles. A breakdown of the latency
for the transmission and reception of an AR transaction is shown in Table 4.1.
   It’s important to note that latency is dependant on the size and type of the transaction, as
well as the state that the buffers are in. The Receiver typically takes the highest number of
cycles due to the reliability mechanisms, the flit cancel from Adapter as well as the tag check.
   During burst mode, the module maintains data throughput from the AXI side as there are
no interruptions in the data.
   Going through the data on the latency and throughput from the Table 4.1, we can observe
and analysis that the latency and throughput are maintained through the design of the offload
engine. The latency of the design is dependent on the design buffers and transaction types
and these can be improved through the constant development on the AXI flits and optimal
utilization of the buffers. We can see that the data throughput is maintained int he sense that
the the AXI data is passed through the offload engine with no loss and the data throughput
can be further improved the introduction for pipeline architecture int he Flit construction
blocks helping multiple transaction to be simultaneously created.
Synthesis Results
   To get a sense of the size if our design, one instantiation of the Offload Engine was synthe-
sized using Vivado for the Zynq UltraScale+ MPSoC ZCU104, which houses an FPGA with
enough blocks to facilitate our design. We took care to prohibit the tool from optimizing away
our logic by doing out-of-context synthesis, since we did not map its outputs to the pins of
the FPGA. From the results we obtained from this synthesis, we note the utilization for one
Offload Engine, shown in Figure 4.5.
                                              65
                        Figure 4.5: Utilization of one Offload Engine
   It is sensible for the Protocol Engine to hose the most logic blocks due to the size of
arbitration logic as well as the Receiver buffers. This is also reflected in the latency of the
Receiver.
   With the raw synthesis for the offload engine we can get the sense of how big the offload
engine can be and which areas are consuming more area. The Area utilization is higher in
the Protocol engine because of the extensive utilization of the receiver buffers. In terms of
improving area utilization we can see that buffer structures like the Write response buffers are
not utilized to its full extent, thus improving the receiver buffer to accommodate multiple type
of transactions in single buffer structure can reduce space wastage and in terms of the our
design can reduce costs by upto 20 percent as Write response is 1/5th of all the transaction
type.
   From the timing analysis, we also know the critical path of our design, which lies in payload
creation in the Flit Construct block.
                                              66
Chapter 5
Conclusion
   In this chapter, we will discuss the limitations and possible improvements for our design of
the Offload Engine. It will be followed by possible improvements for the offload engine.
   By discussing the limitations, possible improvements, and other important considerations
for the Offload Engine, this chapter aims to provide a comprehensive understanding of the
current challenges and opportunities to further integrate the offload engine into bigger archi-
tecture.
5.1    Limitations
The Offload engine is developed as the starting point for the idea of AXI-based Flit data
transfer for the D2D framework. It is crucial to acknowledge the limitations of the offload
engine as it is essential for setting realistic outcomes, and identifying potential improvements
in the system. These are the following constraints to our design:
   • Flit Construct Bottleneck: The design contains two instances of Flit Construct block
     which was designed for the AXI Master/Slave interfaces to generate Flits simultaneously.
     Each of the AXI interfaces further contains the respective channels that need to wait to
     create the Flits as there are 5 AXI channels and 2 Flit Construct (FC) instances.
   • Limited AxSIZE support: The current implementation of the Offload Engine only
     provides support for AxSIZES 1,2 and 4.
   • Symmetric-Only Connection: Two Offload Engines that are connected are mandated
     to have the same tag spaces, the same outstanding transaction limit, and the same
     allocated credits. Though theoretically possible, different tag spaces are not supported.
                                              67
   • Flit Construct Pipelining: The Flit Construct blocks run on the Finite State model
     where for a flit to be generated the module needs to go through all the states. This
     can hamper the data throughput when multiple large burst data need to create flits.
     This can be eliminated if the state machine can be broken down into a pipeline, thus
     maintaining the parallel processing throughout the AXI domain
   • Low power mode for both the AXI Engine and Protocol Engine: With the
     implementation of Clock Gating. The Adapter can request entry into a low-power state,
     during which AXI transactions may be halted.
   • Making use of UCIe control signals and different states: We can request a
     LinkError state from the Adapter. This may be useful if, for instance, an AXI module
     disconnects.
   • Larger AxSIZE support: For AxSIZE under 32, not many things have to change. A
     small addition to the FD and FC blocks is sufficient. For AxSIZE above 32, the whole
     beat does not fit inside the flit. There has been a small discussion on how to handle this
     in section 3.2.2. We would have to assemble each beat in FD and divide it in FC.
5.3 Conclusion
   In essence, the objective of this thesis was to achieve an offload engine to integrate the AXI
protocol within the SoC to a D2D interconnect, in this case UCIe. The background research
included a detailed understanding of the working of advanced Extensible Interface (AXI)
protocol across the SoC subsystems, the concept of Die-to-Die systems and the development of
growing Universal Chiplet Interconnect express(UCIe) as the new D2D Interconnect standard.
   The work in designing the offload engine provides a novice method to transfer AXI signals
through the D2D interconnect, where the AXI-based Flits seamlessly integrate with both
the AXI specifications and the UCIe specifications. Our design facilitates bashing the gap
between the chiplets and replicating the AXI framework across multiple chiplets. Throughout
the designing process we strived to balance:
   • Complexity by setting a limit to outgoing transactions.
   • Latency by allowing two flits to be Constructed and sent simultaneously
   • and size by restricting our buffer sizes to a sensible and optimized amount. The buffers
     act as a constraint in terms of area utilization
   Through the development of the offload engine, the design showed the heavy dependency
on address/data buffers to hold valid data before transfer to maintain integrity throughout
the system. With optimization using pipelining the data throughput will be maintained and
the latency will finally only depend on the physical layer transfer. The journey of design and
extensive testing on the design concludes by maintaining data throughout for the burst mode
transactions across the Dies and observing an initial latency in the range of 20-30 clock cycles.
   As a final note, a direction for future design implementation of the offload engine is to focus
on further utilizing the UCIe capabilities. The integration of the UCIe D2D adapter into the
offload engine can add more control over the data transfer. This will implement the Link State
Management, Parameter Negotiations and the CRC/Retry mechanisms to better optimize the
offload engine.
                                               68
Bibliography
                                             69
[12] Apple Computer Inc., “AppleTalk Session Protocol,” 1985. [Online]. Available: https:
     //developer.apple.com/library/archive/documentation/mac/pdf/Networking/ASP.pdf
[13] “User Datagram Protocol,” RFC 768,            Aug. 1980. [Online]. Available:       https:
     //www.rfc-editor.org/info/rfc768
[14] W. Eddy, “Transmission Control Protocol (TCP),” RFC 9293, Aug. 2022. [Online].
     Available: https://www.rfc-editor.org/info/rfc9293
[15] Tanenbaum, Andrew S. and Wetherall, David, Computer Networks, 5th ed. Boston:
     Prentice Hall, 2011. [Online]. Available: https://www.safaribooksonline.com/library/
     view/computer-networks-fifth/9780133485936/
[16] V. Cerf and R. Kahn, “A Protocol for Packet Network Intercommunication,” IEEE Trans-
     actions on Communications, vol. 22, no. 5, pp. 637–648, 1974.
[17] G. E. Moore, “Cramming more components onto integrated circuits, reprinted from elec-
     tronics, volume 38, number 8, april 19, 1965, pp.114 ff.” IEEE Solid-State Circuits Society
     Newsletter, vol. 11, no. 3, pp. 33–35, 2006.
[18] R. Ratnesh, A. Goel, G. Kaushik, H. Garg, Chandan, M. Singh, and B. Prasad,
     “Advancement and challenges in mosfet scaling,” Materials Science in Semiconductor
     Processing, vol. 134, p. 106002, 2021. [Online]. Available: https://www.sciencedirect.
     com/science/article/pii/S1369800121003498
[19] International Technology Roadmap for Semiconductors, “2011 Executive Summary,”
     2011. [Online]. Available: https://www.semiconductors.org/wp-content/uploads/2018/
     08/2011ExecSum.pdf
[20] J. Lau, Heterogeneous Integrations. Springer, 01 2019.
[21] Microsystems Technology Office of the Defense Advanced Research Projects Agency,
     “Broad Agency Announcement: Common Heterogeneous Integration and IP Reuse Strate-
     gies (CHIPS),” https://viterbi.usc.edu/links/webuploads/DARPA%20BAA%2016-62.
     pdf, 9 2016.
[22] Open Domain-Specific Architecture (ODSA) Consortium, “Open High Bandwidth
     Interface (OHBI) Specification version 1.0,” 2021. [Online]. Available: https:
     //www.opencompute.org/documents/odsa-openhbi-v1-0-spec-rc-final-1-pdf
[23] ——, “Bunch of Wires (BoW) PHY Specification Version 1.9,” 2023. [Online]. Available:
     https://opencomputeproject.github.io/ODSA-BoW/bow_specification.html
[24] “Advanced Interface Bus Specfication.” [Online]. Available:           https://github.com/
     chipsalliance/AIB-specification
[25] “Universal Chiplet Interconnect Express homepage.” [Online]. Available:             https:
     //www.uciexpress.org/
[26] “Universal Chiplet Interconnect Express Specfication 2.0.”
[27] ARM Limited, “Axi protocol l,” 2003. [Online]. Available: https://developer.arm.com/
     documentation/ihi0022/latest/9
[28] C. Ma, Z. Liu, and X. Ma, “Design and implementation of apb bridge based on amba
     4.0,” in 2011 International Conference on Consumer Electronics, Communications and
     Networks (CECNet), 2011, pp. 193–196.
                                              70
[29] N. Dorairaj, D. Kehlet, F. Sheikh, J. Zhang, Y. Huang, and S. Wang, “Open-source axi4
     adapters for chiplet architectures,” in 2023 IEEE Custom Integrated Circuits Conference
     (CICC), April 2023, pp. 1–5.
[30] R. Budruk, D. Anderson, and E. Solari, PCI Express System Architecture.        Pearson
     Education, 2003.
                                            71
                                                      Printed by Tryckeriet i E-huset, Lund 2024