Concepts¶
This page is the DAQIRI glossary. It defines the terms used across the API Guide, Configuration Reference, and tutorials: stream types and protocols, GPUDirect, packet / burst / segment, flow / queue, memory region, zero-copy ownership, and RX reorder.
Stream Types¶
DAQIRI exposes a single C++ API on top of several packet-I/O stacks. The choice is configured per-application in YAML by two keys:
stream_type— the I/O stack family.protocol— required whenstream_type: "socket"; selects the socket-level protocol.
The shipped Ethernet stream types use NICs as their hardware endpoint. The planned PCIe programmable-sensor path uses the same DAQIRI model for devices that sit directly on the PCIe bus, such as FPGAs, frame grabbers, or custom acquisition cards.
Raw Ethernet¶
YAML: stream_type: "raw".
Kernel-bypass raw Ethernet. The application talks directly to NIC ring buffers in user space, skipping the Linux network stack entirely. This is the highest-performance path and the only one with hardware flow steering (see Flows below). Currently implemented on top of DPDK; the DPDK dependency is an implementation detail, not a user-facing concept.
Requires an NVIDIA SmartNIC (ConnectX-6 Dx or later).
Socket¶
YAML: stream_type: "socket". The specific transport is chosen by
protocol:
protocol: "udp"/protocol: "tcp"— Linux kernel UDP and TCP sockets. No NIC privileges required, no special hardware. Useful as a comparison baseline against the kernel-bypass paths and as a way to get first results on a system without an NVIDIA NIC.protocol: "roce"— RDMA over Converged Ethernet, using the open-sourcerdma-corelibrary. A server/client connection model, NIC-level reliable transport (RC), and in-order delivery. Primarily intended for workloads where one endpoint is a third-party device (an FPGA, an instrument, or another customer-supplied black box) that already speaks RoCE. When both peers run DAQIRI, prefer an upper-layer library such as MPI, NCCL, or UCX rather than wiring RoCE directly.
PCIe (future)¶
YAML: stream_type: "pcie".
Coming-soon path for sensors that appear directly on the PCIe bus, such as FPGAs, frame grabbers, or custom acquisition cards. The goal is to move data into or out of CPU or NVIDIA GPU memory through the same DAQIRI C++/Python API while avoiding unnecessary copies. This stream type does not currently ship with a runnable benchmark or example YAML.
Choosing a stream type¶
The right choice depends on packet size, batch size, latency target, whether you need ordering or hardware reliability, and what the other end of the link looks like. DAQIRI's job is to make swapping among them a configuration change rather than a code change.
For a use-case-driven decision tree (baseline throughput, GPU reorder, header-data split, multi-queue flow steering, packet recording, RDMA, sockets), see Choosing an example config in the configuration walkthrough.
Support and testing
The DAQIRI library integration testing infrastructure is under active development. As such:
- Raw Ethernet (
stream_type: "raw") is supported, distributed with the DAQIRI library, and is the only stream type actively tested at this time. - Socket — UDP / TCP (
stream_type: "socket",protocol: "udp"/"tcp") is supported and distributed; integration testing is under development. - Socket — RoCE (
stream_type: "socket",protocol: "roce") is supported and distributed; integration testing is under development. - The PCIe programmable-sensor path is under development.
GPUDirect¶
GPUDirect allows the NIC to read and write data from/to a GPU without staging it through system memory first. That decreases CPU overhead and significantly reduces latency. An implementation of GPUDirect is supported by every DAQIRI stream type.
The two paths look like this:
flowchart LR
subgraph withGPUDirect [With GPUDirect]
nicA[NIC] -->|"PCIe peer-to-peer DMA"| gpuA[GPU memory]
end
subgraph withoutGPUDirect [Without GPUDirect]
nicB[NIC] -->|"DMA"| cpuB[CPU staging buffer] -->|"cudaMemcpy"| gpuB[GPU memory]
end
The GPUDirect path skips the CPU-side staging buffer and the
cudaMemcpy that goes with it.
Warning
GPUDirect is only supported on RTX GPUs and Data Center GPUs. It is not supported on GeForce cards.
How does that relate to peermem or dma-buf?
There are two kernel interfaces to enable GPUDirect:
- The
nvidia-peermemkernel module, distributed with the NVIDIA DKMS GPU drivers.- Supported on Ubuntu kernels 5.4+, deprecated starting with kernel 6.8.
- Supported on NVIDIA-optimized Linux kernels, including IGX OS and DGX OS.
- Supported by all MOFED drivers (requires rebuilding
nvidia-dkmsdrivers afterwards).
DMA Buf, supported on Linux kernels 5.12+ with NVIDIA open-source drivers 515+ and CUDA toolkit 11.7+.
The DPDK that ships in the DAQIRI container is patched with dma-buf
support, so the nvidia-peermem kernel module is not required
inside the container.
For step-by-step system setup, see the System Configuration tutorial.
Packets, Bursts, and Segments¶
DAQIRI is a batch processing library. Packets are received from DAQIRI and sent to DAQIRI in batches called bursts. Larger bursts can increase throughput at the expense of latency; smaller bursts decrease latency but cap total throughput because of the per-burst processing overhead. The terms below appear throughout the API, configuration, and code paths.
Packet¶
A packet is a single, contiguous block of memory representing
either received data or data to transmit. Packets can be far larger
than an Ethernet MTU in some cases (for example with protocol: "roce"
or protocol: "tcp"/"udp"); the underlying stack fragments and
reassembles them on the wire transparently.
Burst (BurstParams)¶
A burst is the metadata container DAQIRI uses to describe a batch
of packets being transmitted or received. The C++ type for a burst is
BurstParams. It is intentionally opaque — applications use helper
functions (get_packet_ptr, get_packet_length, get_num_packets,
...) to inspect or modify it rather than touching its fields directly.
Segment¶
A segment is one contiguous memory region inside a packet. A packet can have one segment or multiple segments:
- Single segment: the whole packet fills one contiguous region.
- Multiple segments: each segment is assigned to a different memory region. The memory regions can be of any kind (CPU or GPU) in any order. A common use case is header-data split (HDS) below.
Header-Data Split (HDS)¶
Header-data split is the canonical multi-segment configuration: headers go to CPU memory (segment 0), payload goes to GPU memory (segment 1). This keeps the GPU payload path zero-copy for downstream GPU workloads while still letting the CPU parse and steer on the headers.
Use HDS when the application needs to inspect headers (UDP source/destination ports, application-layer sequence numbers, etc.) but the bulk of the data is meant for the GPU.
Flows and Queues¶
These two terms describe how packets are routed from the wire into the right application buffer.
Queue¶
A queue is a NIC-side buffer that an application reads from (RX) or writes to (TX). Each queue is assigned a CPU core in the YAML to poll it, but queues and cores are not strictly one-to-one: one core can poll multiple queues, and a TX queue can be fed from multiple cores.
A queue points at one or more memory regions, which hold its packet buffers (CPU hugepages, GPU device memory, or pinned host memory).
Flow¶
A flow is a match pattern paired with an action. The common action is to steer matching packets into a specific queue. For example, all UDP-destination-port-4096 packets can be routed into a queue backed by GPU memory. Matching and the resulting action both run entirely in NIC hardware.
Flow rules are only available in Raw Ethernet (stream_type: "raw").
A flow's match can combine fields such as udp_src, udp_dst, and
ipv4_len; multiple flows can target the same queue, and the matching
flow's ID is available at runtime so the application can distinguish
them. Flows are configured under rx.flows in the YAML.
Flow Steering¶
Flow steering is the NIC-level mechanism that classifies an incoming packet against the configured flows and writes it into the matching queue's buffer, entirely in hardware. Multi-queue RX works by routing each flow to a separate queue for parallel processing.
For Raw Ethernet, flow steering is implemented on top of RTE Flow. The YAML options are documented in Configuration YAML Reference → Flows.
Memory Regions¶
A memory region is a named pool of buffers where packet data lives. Memory regions are declared at the top of the YAML and referenced by name from each queue.
The kind of a memory region determines whether packet data ends up on the CPU or the GPU:
huge: CPU hugepages (recommended for CPU buffers).device: GPU VRAM (discrete GPUs; requires GPUDirect via peermem or DMA-BUF).host_pinned: pinned CPU pages allocated viacudaHostAlloc. Recommended on integrated GPUs (NVIDIA GB10 / DGX Spark), where the NIC cannot peer-DMA into device memory.host: regular CPU memory (not recommended for hot paths).
The size of the memory region (buf_size) dictates the largest
contiguous chunk that can be stored in a single segment. For example,
with a 60-byte region the first 60 bytes of each packet land in that
segment before the remainder spills into the next region in the
queue's list. Region buffers can be much larger than a single Ethernet
frame for fragmented transports (for example, protocol: "roce").
Combining memory regions on a single queue is how header-data split
is expressed in the YAML: queue 0's first memory region is a huge CPU
pool (for headers, segment 0); its second region is a device GPU pool
(for payload, segment 1).
Zero-Copy Ownership¶
DAQIRI is designed around zero-copy packet delivery. When a receive API returns packet data, the application is reading the buffers the NIC DMA'd into; the API passes pointers and metadata, not copies.
That zero-copy model makes buffer release part of the API contract.
Applications must free RX bursts after processing and free or send TX
bursts after allocation. Holding bursts indefinitely drains DAQIRI's
buffer pools and can lead to NO_FREE_BURST_BUFFERS,
NO_FREE_PACKET_BUFFERS, queue drops, or stalled TX.
When to call each free_* function is documented in the
C++ API Usage page.
RX Packet Aggregation and Reorder¶
DAQIRI can perform GPU- or CPU-side packet aggregation and reordering
on RX through rx.reorder_configs:
- GPU reorder configs copy selected packet payloads into a configured output memory region and deliver the result as one reordered aggregate burst.
- CPU reorder configs provide the same aggregate-burst model for CPU-addressable packet and output memory.
This is the path to use when packets arrive out of order (e.g. across multiple NIC queues) and need to be reassembled into a single, contiguous GPU buffer before downstream processing.
Each reorder config currently operates on a single memory domain, either GPU-only or CPU-only. Reordering packets whose segments span two memory regions (for example, an HDS pair with CPU-side headers and GPU-side payloads) is not yet supported but is planned.
See Configuration YAML Reference → RX Reorder Configs for the configuration constraints and C++ API Usage → Reordered RX bursts for how to consume them from C++.
See also¶
- API Guide: the 6-step DAQIRI application lifecycle, with links into the language API.
- Configuration YAML Reference: every YAML key, its type, and its valid values.
- C++ API Usage: initialization, RX/TX, file writes, utilities, and the C++ function reference.
- System Configuration tutorial: the hardware and OS setup the concepts above depend on.