Mazu is a bare-metal RISC-V 64-bit hard RTOS that combines Linux kernel discipline with Plan 9 philosophy in a system small enough to read end to end. SMP correctness, hard real-time scheduling, and kernel-integrated networking are not bolted on after the fact -- they shape every data structure and code path from the start.
Unlike RTOSes that treat networking as an optional middleware layer and SMP as a bolt-on configuration flag, Mazu takes the opposite position: a connected embedded system needs bounded-latency scheduling, per-CPU execution paths, and a TCP/IP stack that respects both, all in the same address space, all under the same lock discipline. The kernel serves REST APIs and runs a web-based shell as ordinary preemptible tasks alongside deadline-scheduled control work.
Two design lineages run through the codebase:
- From Linux: subsystem modularity (initcall registration, irqchip vtables,
IRQ descriptor tables, waitqueues, lockdep), synchronization primitives
(priority-inheritance mutexes, futexes with PI and requeue, counting
semaphores with direct handover), buddy allocator, per-CPU data via the
gpregister, and the convention that every subsystem is SMP-safe or explicitly documented otherwise. - From Plan 9: the "everything is a file" control plane. Synthetic
filesystems (
/dev,/proc,/net) expose hardware, process state, and network tables as readable files -- noioctl, no sysfs, no procfs special-case parsers. System observability comes fromcat /net/tcp/stats, not a dedicated monitoring daemon.
What Mazu does not import from either lineage is equally deliberate: no
loadable modules, no virtual memory isolation between tasks, no VFS page
cache, no socket API. The kernel runs all tasks in a single shared page
table (identity-mapped kernel space, shared user mappings at fixed VAs) with
VMA-based access control, and the networking API is a direct function
interface rather than a Berkeley sockets layer. Disk-backed SFS has its own
block buffer cache (kernel/fs/bcache.c); the synthetic and RAM filesystems
are uncached because their data is either memory-resident or generated on
demand. These choices keep the system small and auditable.
PSE51 framing: Mazu implements a bounded PSE51-oriented userspace core with
deliberate filesystem and multi-process supersets. PSE51 itself is a
single-process, threaded, no-filesystem profile; Mazu ships a real
filesystem, SYS_SPAWN / SYS_WAIT, and multiple PIDs by design, so the
honest top-level framing of the user-visible environment is closer to
PSE52 (Realtime Controller System Profile). The kernel-level primitives
that back PSE51-facing syscalls (PI mutexes, condvars, semaphores,
futexes, barriers, rwlocks, message queues, POSIX timers) are already
in place. Per-syscall conformance status, including which entries use a
Mazu-specific ABI shape rather than the exact POSIX shape, is tracked in
docs/pse51-matrix.md.
Core profile:
- Hard-RT scheduling: mandatory kernel preemption, SMP per-CPU run queues, EEVDF fairness, EDF deadline scheduling with admission control, mixed-criticality domains, load balancing, and scheduling domains with budget enforcement
- SMP by design: per-hart state via
gpregister, per-CPU run queues and merged deadline management, lockdep lock-ordering enforcement, cache-line-aligned per-CPU structures to eliminate false sharing - Kernel-integrated networking: IPv4, TCP (Reno CC, SACK, RTT estimation, connection pooling, per-IP flood limits), optional UDP/DHCP/mDNS, outbound client connections, HTTP/1.1 server with REST endpoints, WebSocket, and SSE -- all running as preemptible scheduler tasks
- Plan 9-style VFS: synthetic
/dev(null, zero, console, time),/proc(meminfo, uptime, cpuinfo),/net(arp, iface, tcp/stats) alongside a RAM filesystem with optional writable and virtio-blk paths - Linux-grade synchronization: PI mutexes with direct handover, condition variables, counting semaphores, futexes (WAIT/WAKE/CMP_REQUEUE/LOCK_PI/UNLOCK_PI)
- Type-driven safety: length-prefixed fat strings (never null-terminated), macro-generated result types (errors never share the return-value space), read-only/read-write/appendable buffer types encoding mutability in the type system
- Memory: buddy allocator for pages, pool allocators for fixed-size objects (no external fragmentation), arena allocators for request-scoped temporaries (no per-object allocation headers), pluggable allocator vtable
- Kernel-user isolation: W^X, VMA-based user-pointer containment validation, per-process syscall allow-list, kernel-stack guard pages, stack-protector canaries
- Debug and verification: lockdep, scheduler invariant checks on every context switch, callout lateness histograms, self-test framework, UBSan trap mode, static analysis via clang
- QEMU
virtmachine: virtio-mmio devices, PLIC, OpenSBI, Sv39 paging (identity-mapped, 2 MiB superpages with on-demand shattering)
Primary development target is QEMU on Linux with standard tooling (make,
python3, RISC-V cross toolchain, QEMU).
A planned use case is running an embedded AI assistant (similar to MimiClaw) directly on the kernel. This requires outbound TCP client connections, TLS or a proxy relay strategy, JSON request/response parsing, streaming responses (SSE or WebSocket), and persistent state management.
Quick start (SLIRP networking, local host access):
make defconfig # generate .config from configs/defconfig
DEBUG=2 make run # build and launch in QEMU (SLIRP, http://localhost:8080)TAP networking (guest at 192.168.100.2, Linux + iptables required):
export IF=eth0 # outward-facing host interface
./scripts/setup_vm_network.sh
TAP=1 make runRuntime content model:
rootfs/is packed into the kernel image at build time byscripts/archive.py.rootfs/config.txtprovides runtime network settings (used when DHCP is disabled/fails).rootfs/web/is the HTTP document root (static assets and web UI).rootfs/hello.txtis printed during boot.
The web server (user/net/web.c) provides both static file serving and dynamic REST API
endpoints. Current API surface:
| Endpoint | Method | Description |
|---|---|---|
/api/stats |
GET | Kernel stats (tasks, IRQs, memory, scheduler, callout, security) as JSON |
/api/tcp |
GET | TCP connection table (including cwnd/ssthresh per connection) as JSON |
/api/arp |
GET | ARP table as JSON |
/api/klog |
GET | Kernel log ring buffer as JSON |
/api/fs?path=X |
GET | Directory listing as JSON |
/api/fs/read?path=X |
GET | File content as text/plain |
/api/shell/in |
GET/POST | Web terminal: create session / submit command |
/api/shell/out |
GET | Web terminal: read output (polling) |
/api/sse/test |
GET | SSE test endpoint (chunked transfer encoding) |
WebSocket upgrade is supported for real-time communication (e.g., terminal streaming).
Additional MIME types and API endpoints can be added in user/net/web.c.
Build/runtime knobs:
DEBUGcontrols kernel log verbosity:0: warnings and errors1: info2: debug3: verbose
RELEASE=1enables optimized builds.
A practical default for development is DEBUG=2 with RELEASE unset.
Mazu uses a Kconfiglib-based
configuration system (the same Kconfig language used by the Linux kernel).
The configuration schema lives in configs/Kconfig.
make config # interactive menuconfig TUI
make defconfig # apply configs/defconfig (default config)
make defconfig DEFCONFIG=configs/rt_defconfig # apply a named defconfig
make defconfig DEFCONFIG=configs/defconfig CONFIG_FRAGMENTS=configs/fragments/up.config
make savedefconfig # save current .config back to configs/defconfig
make oldconfig # update .config for new/changed Kconfig symbolsKconfiglib is auto-cloned into tools/kconfig/ on first use. The generated
.config file is included by Make, and build/config.h is generated for C
code. Disabled features contribute zero text and zero BSS to the kernel image.
Predefined configurations:
| Defconfig | Description |
|---|---|
configs/defconfig |
Default hard-RT profile: SMP, latency tracing, TCP/UDP, DHCP, mDNS, SACK, virtio-blk |
configs/rt_defconfig |
Leaner hard-RT validation profile with SMP, EEVDF, and network options enabled |
Reusable configuration fragments:
| Fragment | Description |
|---|---|
configs/fragments/up.config |
Force uniprocessor mode for QEMU 8.2-based CI or local repros |
configs/fragments/ubsan.config |
Enable trap-mode UBSan on top of an existing defconfig |
Key feature flags (all configurable via make config):
| Symbol | Default | Description |
|---|---|---|
CONFIG_NET_TCP |
y | TCP/IP stack with connection pool, Reno CC, RTT retransmission, sliding window |
CONFIG_WEBSOCKET |
y | WebSocket upgrade path (SHA-1, Base64, frame codec, PING/PONG/CLOSE) |
CONFIG_SCHED_PREEMPTIVE |
y | Mandatory timer-driven kernel preemption |
CONFIG_SCHED_EEVDF |
y | EEVDF fair scheduling within priority levels |
CONFIG_SMP |
y | Symmetric multiprocessing (per-CPU run queues, load balancing) |
CONFIG_NET_UDP |
y | UDP transport (required by DHCP and mDNS) |
CONFIG_VIRTIO_BLK |
y | VirtIO block device driver |
CONFIG_RAMFS_WRITABLE |
y | Writable RAM filesystem |
CONFIG_SEMIHOSTING |
y | RISC-V semihosting for host communication and self-tests |
CONFIG_DHCP |
n | DHCPv4 boot-time client |
CONFIG_NET_MDNS |
n | mDNS responder for mazu.local (RFC 6762) |
CONFIG_DEBUG_ENDPOINT |
n | /debug HTTP endpoint for runtime inspection |
A legacy configuration path via config-riscv64.mk is still supported for
backward compatibility when no .config file exists.
Validation shortcuts:
make check # HTTP integration tests (SLIRP networking)
make check-selftest # semihosting self-tests (requires CONFIG_SEMIHOSTING=y)
make check-smp # SMP-focused checks (requires CONFIG_SMP=y)
./scripts/check.sh # matrix-style checks across selected profiles
./scripts/check.sh --profile-matrix # build + selftest defconfigs and CI overlaysMazu transmits all traffic — including web-terminal keystrokes and output — over plaintext HTTP with no transport security. It is not safe to expose the kernel directly on an untrusted LAN or the public internet.
The recommended approach is to place Mazu behind a TLS-terminating reverse
proxy on the same host or inside the same trusted network segment. Popular
options include Nginx (proxy_pass http://192.168.100.2) and Caddy. A
WireGuard tunnel is also appropriate when the host itself is remote.
Deployment constraints to enforce:
-
TLS: Use a reverse proxy (Nginx, Caddy) or a WireGuard tunnel to encrypt traffic in transit. TLS alone does not authenticate users.
-
Authentication: Add HTTP basic auth at the proxy layer (or equivalent network ACLs) to prevent unauthorized access to the web terminal. The kernel itself has no user authentication beyond session tokens for the terminal API.
-
Network isolation: The kernel listens on every address it has an IP for. Restrict access at the firewall or via the TAP interface to limit exposure.
Note: Adding a TLS library (e.g., BearSSL, ~45 KB) directly to the kernel
is an option if standalone deployment is required, but conflicts with the
compact-kernel goal. A host-side TLS relay proxy (tools/tls_relay.py) is
provided for development: it accepts plaintext HTTP from the guest and
forwards to upstream HTTPS APIs.
The kernel enforces a syscall security policy at the kernel-user boundary.
Each syscall passes through a 3-gate authorization check: valid syscall
number (ENOSYS), process context requirement (EPERM), and per-process
syscall_allow bitmask whitelist (EACCES). Denied syscalls are logged to
the kernel log ring buffer. Security counters are exposed via /api/stats.
Kernel memory safety hardenings:
- W^X enforcement: user-space pages cannot be simultaneously writable and executable. Code pages are loaded as R+W, then transitioned to R+X after the binary data is copied. ELF binaries requesting W+X segments are rejected at load time.
- User-pointer sanitization:
copy_from_user/copy_to_uservalidate addresses at three layers: range check (within user address space), overflow check (no wrap-around), and per-page PTE_U verification via page-table walk. When the process has registered VMAs (n_vmas > 0), a fourth layer checks VMA containment (address must fall within a registered virtual memory area). Note: the current VMA check (proc_vma_contains) verifies spatial containment only -- VMA permission bits (READ/WRITE/EXEC) are not consulted; permission enforcement relies on PTE bits and hardware faults. Additionally,copy_to_usercallsuser_addr_validrather than the stricteruser_addr_writable, so the software-side write check is deferred to the hardware SUM fault path. - VMA tracking: per-process virtual memory area list records code, stack,
and data regions with permissions. Accesses outside any registered VMA
return EFAULT. Processes with no registered VMAs (
n_vmas == 0, e.g., kernel tasks) fall back to range + PTE_U validation only. VMA permission enforcement is a known gap. - Stack-protector: GCC
-fstack-protector-strong(whenCONFIG_STACK_PROTECTOR=y) places canary values between local variables and return addresses. The canary is initialized early in boot from entropy-mixedrdtime()samples. Corruption triggers a hard halt. - Guard pages: the bottom 4 KiB of each kernel task stack is unmapped. Stack overflow causes a page fault instead of silent memory corruption. User-space stacks do not currently have an explicit guard page.
- Magic number validation (debug builds):
struct procandstruct sched_taskcarry magic numbers validated on access. Freed process and task objects are poisoned with0xDEADDEADfor use-after-free detection. This coverage is scoped to proc/task structures, not all kernel objects.
The C programming style provides additional defense through
length-prefixed strings (struct str) that eliminate null-termination
bugs, and separate types for read-only, read-write, and appendable
buffers that enforce mutability constraints at the type level. No uses of
strlen, strcpy, sprintf, or other unbounded C string functions
exist in the kernel or user code.
Use QEMU's GDB stub:
GDB=1 make run
riscv-none-elf-gdb build/kernel.elf
# inside gdb:
target remote localhost:1234QEMU will pause at startup until GDB connects.
Mazu sits at an intersection that few systems occupy. Linux is the gold standard for SMP correctness and subsystem modularity but carries decades of generality that a hard-RT embedded kernel cannot afford. Plan 9 solved the observability problem elegantly -- expose system state as files, not ioctls -- but never targeted real-time or bare-metal embedded systems. Mazu takes the structural discipline of one and the operational philosophy of the other, and leaves behind the parts that conflict with bounded latency and small code size.
From Linux, Mazu borrows the patterns that keep a concurrent kernel honest:
level-ordered init hooks (DEFINE_INIT_HOOK with INIT_LEVEL_CORE /
INIT_LEVEL_SUBSYS levels, plus a DAG-based initgraph for dependency
ordering), irqchip vtables that decouple trap dispatch from
PLIC-specific MMIO, IRQ descriptor tables with request_irq() / free_irq()
registration, lockdep-style lock-ordering enforcement via per-CPU held_locks
bitmasks, waitqueue-based blocking with timeout callouts, and per-CPU state
accessed through the gp register so that hot-path scheduler and timer code
never touches a global lock. The synchronization primitives -- PI mutexes with
direct handover, condition variables, semaphores with FIFO direct-handover,
futexes with CMP_REQUEUE and PI -- are not simplified versions of their Linux
counterparts; they enforce the same invariants, just without the backward-
compatibility layers.
From Plan 9, Mazu borrows the idea that system state belongs in the filesystem
namespace. Three synthetic filesystems -- /dev (null, zero, console, time,
sysname), /proc (meminfo, uptime, cpuinfo), /net (arp, iface, tcp/stats)
-- generate their content on each read with no pre-computed state and no heap
allocation. The VFS mount table uses longest-prefix matching to dispatch reads
to the correct filesystem vtable. The result: cat /proc/meminfo or
cat /net/tcp/stats works the same way whether called from a shell task or
the REST API, with no special-purpose monitoring code.
What Mazu explicitly does not take: Linux's loadable modules, VFS page cache, socket layer, and process isolation model; Plan 9's network stack and user-space server model. The kernel runs all tasks in a single shared page table (kernel regions identity-mapped, user pages at fixed VAs within the same table). Disk-backed SFS uses a block buffer cache; synthetic and RAM filesystems are uncached. Networking is a direct function interface, not a socket API. These omissions are permanent design choices, not items on a backlog.
When SMP support is added as a configuration option on top of a single-core
design, concurrency bugs hide until someone enables the second core. Mazu
inverts this: SMP shapes the data structures, and single-core is the
NR_CPUS=1 special case.
Every hart owns its own run queue, sorted callout list, timer deadline, and
interrupt counters. The struct pcpu is cache-line aligned (64 bytes) and accessed via
the gp register in a single instruction -- no hash table, no array index.
Merged deadline management (min(timer, preempt, watchdog)) reduces hardware
timer reprogramming to the cases where the earliest deadline actually changes.
Lock ordering is enforced at compile time through level constants
(IRQ < PROC < FD < SIG < WAITQ < TCP < SCHED < CALLOUT < ALLOC) and at
runtime through lockdep assertions on every acquire and release.
The scheduler is unconditionally preemptive -- CONFIG_SCHED_PREEMPTIVE is
mandatory, and the build fails if someone tries to disable it. Every trap exit
drains need_resched via an atomic exchange. EEVDF provides fairness within
priority levels. EDF deadline scheduling with admission control and budget
enforcement targets hard-deadline workloads. Mixed-criticality scheduling
domains partition CPU time between high-criticality (control) and
low-criticality (web/telemetry) task groups with automatic escalation and
recovery.
A common RTOS approach to networking is a separately-maintained IP stack (lwIP, or a BSD-derived layer) integrated with the scheduler through an adapter that bridges two different threading and memory models. That integration boundary can become a source of priority inversions, lock contention, and latency surprises.
Mazu puts TCP/IP inside the kernel under the same lock discipline as everything
else. The receive path is a preemptible scheduler task that drains the
virtio-net ring buffer, demultiplexes through ARP/ICMP/TCP, and hands data to
connection-specific circular buffers -- all under the same lockdep enforcement
that governs the scheduler. TCP connections live in a pool allocator (no
external fragmentation, O(1) alloc/free). Per-IP connection limits (TCP_MAX_CONNS_PER_IP,
TCP_MAX_SYN_RCVD_PER_IP) bound resource consumption under SYN floods without
a separate firewall. The HTTP server runs as a normal preemptible task that
yields its quantum like any other -- a deadline-scheduled task on the same
hart preempts it on the next trap exit.
The design goal is that bounded latency and network correctness share the same scheduler, the same lock hierarchy, and the same per-CPU state rather than living in separate subsystems that must be reconciled at runtime.
The C programming style emphasizes correctness and readability through abstractions that differ from the C standard library, heavily inspired by Chris Wellons' writing (see nullprogram.com). Core elements:
- Length-prefixed strings
(
struct str { char *dat; sz len; }) instead of null-terminated strings - Structured return values: either a success with a value or an error with a code -- error codes never share the return value space
- Separate types for read-only (
byte_view,str), read-write (byte_array), and appendable (byte_buf,str_buf) memory regions - Arena allocators for short-lived storage (e.g., per-request HTTP parsing)
- Pool allocators for fixed-size objects (TCP connections, send buffers)
These abstractions carry semantic information in the type system. The
mutability hierarchy eliminates entire classes of buffer-overflow and
use-after-free bugs while keeping the code readable enough for low-level
maintenance. Representative examples of this style include the ramfs
(kernel/fs/ramfs.c) and IP layer (kernel/net/ip.c).
This section documents how the major subsystems fit together -- context that cannot be found in the code alone.
OpenSBI firmware runs in M-mode and hands control to the kernel entry point
at arch/riscv64/entry.c. The entry code saves the FDT pointer from a1,
sets up the initial stack, zeros BSS, and jumps to kernel_init() in
kernel/init/main.c.
Early boot parses the Flattened Device Tree (FDT) to discover hardware:
PLIC base, UART base, VirtIO-mmio slots, timebase frequency, and DRAM
layout. All MMIO addresses are resolved from the FDT with fallbacks to the
QEMU virt machine defaults.
The kernel_init function calls subsystem initialization in dependency
order: memory (mem_init configures the dynamic region), paging
(paging_init builds Sv39 three-level identity-mapped page tables using
2 MiB superpages, activates satp, and flushes the TLB), heap
allocators (kvalloc_init), then initgraph_run(INIT_FLAG_PRIMARY) which
executes a DAG-based dependency graph of init tasks using Kahn's topological
sort (scheduler -> watchdog/loadbal/tcp -> mdns).
Architecture init (arch_init) brings up the UART, installs the PLIC-backed
trap vector via the irqchip vtable, and hardens CSR state (clears
sstatus.SUM and sstatus.MXR).
After init, the kernel creates core service tasks (packet receive, TCP maintenance, web serving, optional probes) and enters the scheduler loop.
Mazu ships only hard-RT scheduling profiles:
- Default SMP hard-RT profile (
configs/defconfig) - RT validation profile (
configs/rt_defconfig)
Timer-driven quanta force reschedule points at every priority level. When
CONFIG_SMP is enabled, each hart has its own run queue with a per-CPU lock,
an idle-steal path for pull migration, and a periodic load balancer using
exponential-decay estimation. Optional EEVDF fair scheduling
(CONFIG_SCHED_EEVDF) bounds wake-to-run latency within each priority level.
Deadline scheduling and mixed-criticality extensions build on the same SMP-first
model. Scheduling domains (struct sched_domain) enforce per-group CPU budgets
with automatic refill.
Kernel services are driven by scheduler tasks created during boot. Typical long-lived tasks include:
- Packet receive path (
netdev-> protocol dispatch) - TCP retransmission/callout maintenance
- Activity watchdog (detects hung tasks after 5s inactivity)
- SMP load balancer (periodic rebalancing across harts)
- HTTP/WebSocket request handling
Networking is a core subsystem in Mazu because the kernel is intended for
connected embedded systems rather than isolated firmware. On QEMU virt,
networking is provided by the virtio-mmio net driver
(drivers/net/virtio_mmio.c) through the netdev abstraction layer. The RX
interrupt path pushes frames into the netdev input queue; scheduler tasks
drain that queue and pass packets up the protocol stack.
The receive task checks the queue, validates packet structure at each layer, and demultiplexes to ARP/ICMP/TCP (plus optional UDP-based protocols). Replies are usually emitted during packet handling; TCP also stages sent data in retransmission queues.
The routing table is initialized from rootfs/config.txt in kernel_init (or DHCP-derived values
when enabled). At transmit time, route lookup chooses interface/gateway; unresolved L2 next-hop
addresses trigger ARP resolution before payload transmission.
The TCP subsystem is the most complex part of Mazu and is documented in the most detail here.
- RFC 1323: TCP Extensions for High Performance
- RFC 6298: Computing TCP's Retransmission Timer
- RFC 9293: Transmission Control Protocol (TCP)
The essential functions exposed by the TCP subsystem:
/* Server-side (listen/accept) */
struct tcp_conn *tcp_conn_listen(struct ipv4_addr addr, u16 port, struct arena tmp);
struct tcp_conn *tcp_conn_accept(struct tcp_conn *listen_conn);
/* Client-side (active open) */
struct tcp_conn *tcp_conn_connect(struct ipv4_addr remote_addr, u16 remote_port, struct arena tmp);
bool tcp_conn_is_connected(struct tcp_conn *conn);
bool tcp_conn_is_reset(struct tcp_conn *conn);
/* Data transfer and teardown */
struct result_sz tcp_conn_send(struct tcp_conn *conn, struct byte_view payload, bool *peer_closed_conn,
struct arena tmp);
struct result_sz tcp_conn_recv(struct tcp_conn *conn, struct byte_buf *buf, bool *peer_closed_conn);
struct result tcp_conn_close(struct tcp_conn **conn, struct arena tmp);The primary consumer of the TCP interface is the web server. For comparison, the equivalent Berkeley Sockets setup looks like:
sfd = socket(AF_INET, SOCK_STREAM, 0);
bind(sfd, (struct sockaddr *)&addr, sizeof(addr));
listen(sfd, BACKLOG_SIZE);The effect of all three calls is implemented by tcp_conn_listen in the Mazu TCP interface,
which takes as arguments an IP address and a port number. tcp_conn_listen returns a connection
structure that serves as a handle for a LISTEN-state connection ("listen connection" for short)
that the function creates. This connection structure essentially serves the purpose of the
sfd file descriptor in the example above.
Connections can be accepted with tcp_conn_accept. The only argument to tcp_conn_accept is
a listen connection. If a peer has tried to establish a connection with the right IP address
and port before tcp_conn_accept is called, a struct tcp_conn handle for this connection is
returned by tcp_conn_accept. Here, the "right" IP address and port number are, of course, the
IP address and port number that were passed to tcp_conn_listen. tcp_conn_accept can be polled
to await a connection.
Note that tcp_conn_listen and tcp_conn_accept both create a new connection. I.e., after
calling each function once and getting a non-NULL return value both times, there exist two
connections: one in the LISTEN state and one representing an active connection to a peer. This,
in turn, means a listen connection can be reused indefinitely to accept further connections. The
listen connection is deleted only after calling tcp_conn_close on it.
Three operations can be performed on open connections returned by tcp_conn_accept: sending data
to the peer, receiving data from the peer, and closing the connection. Connections are closed and
deleted by tcp_conn_close.
The send and receive functions can be called arbitrarily often.
tcp_conn_send takes a connection, a payload of bytes, and a pointer to a
boolean flag indicating whether the peer has closed the connection. If the peer
has closed the connection, it will not acknowledge new data; callers should
check this flag periodically and close the connection when it is set. The return
value indicates the number of bytes transmitted. TCP uses a sliding-window
approach to traffic control. If the caller sends data faster than the peer can
acknowledge, the window fills up, the implementation stops transmitting, and the
return value of tcp_conn_send is smaller than the payload length.
tcp_conn_send internally splits the payload into fragments small enough to fit
one Ethernet frame. Larger TCP segments could rely on IP-layer fragmentation,
but that has a downside: TCP retransmits at the segment level, so losing a
single IP fragment forces retransmission of the entire segment. Fragmenting at
the TCP level avoids this overhead. After splitting the payload, each fragment
is transmitted immediately and also added to the send buffer queue (SBQ) of the
connection for retransmission (see below).
tcp_conn_recv takes a connection, a destination buffer, and the peer-closed
flag. The TCP implementation buffers all received data internally. On each call,
available data is copied from the internal circular buffer into the destination
buffer. The amount copied is limited by whichever is smaller: available data or
destination capacity. The return value is the byte count copied; 0 means no
data is available.
The TCP subsystem also supports outbound (active open) connections via tcp_conn_connect.
This allocates an ephemeral port (49152-65535), sends a SYN, and returns a handle in SYN_SENT
state. The caller polls tcp_conn_is_connected until the three-way handshake completes (or
tcp_conn_is_reset to detect failure). Once connected, tcp_conn_send and tcp_conn_recv
work identically to server-side connections. This enables the kernel to make outbound HTTP
requests, which is required for the AI assistant use case.
A note on the peer-closed flag: the Berkeley Sockets API returns -EOF from
read(2) when a connection closes, packing error codes into the negative range
of non-negative return values. Mazu rejects this practice. A separate boolean
flag is a better fit because the condition ("has the peer closed?") does not
need to be checked on every call -- only eventually, to avoid infinite loops.
The TCP protocol is based on a per-connection state machine. RFC 9293 contains an ASCII diagram of the different states:
┌─────────┐ ────────────\ active OPEN
│ CLOSED │ \ ───────────
└─────────┘◀─────────\ \ create TCB
│ ▲ \ \ snd SYN
passive OPEN │ │ CLOSE \ \
──────────── │ │ ────────── \ \
create TCB │ │ delete TCB \ \
▼ │ \ \
rcv RST (note 1) ┌─────────┐ CLOSE │ \
────────────────────▶│ LISTEN │ ────────│ │
/ └─────────┘ delete TCB │
/ rcv SYN │ │ SEND │ │
/ ─────────── │ │ ─────── │ ▼
┌────────┐ snd SYN,ACK / \ snd SYN ┌────────┐
│ │◀───────────────── ────────────────▶ │ │
│ SYN │ rcv SYN │ SYN │
│ RCVD │◀──────────────────────────────────────────────│ SENT │
│ │ snd SYN,ACK │ │
│ │────────────────── ──────────────────│ │
└────────┘ rcv ACK of SYN \ / rcv SYN,ACK └────────┘
│ ────────────── │ │ ───────────
│ × │ │ snd ACK
│ ▼ ▼
│ CLOSE ┌─────────┐
│ ─────── │ ESTAB │
│ snd FIN └─────────┘
│ CLOSE │ │ rcv FIN
▼ ─────── │ │ ───────
┌─────────┐ snd FIN / \ snd ACK ┌─────────┐
│ FIN │◀──────────────── ──────────────────▶│ CLOSE │
│ WAIT-1 │────────────────── │ WAIT │
└─────────┘ rcv FIN \ └─────────┘
│ rcv ACK of FIN ─────── │ CLOSE │
│ ────────────── snd ACK │ ─────── │
▼ × ▼ snd FIN ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│FINWAIT-2│ │ CLOSING │ │ LAST-ACK│
└─────────┘ └─────────┘ └─────────┘
│ rcv ACK of FIN │ rcv ACK of FIN │
│ rcv FIN ────────────── │ Timeout=2MSL ────────────── │
│ ─────── × ▼ ──────────── × ▼
\ snd ACK ┌─────────┐delete TCB ┌─────────┐
────────────────────▶ │TIME-WAIT│───────────────────▶│ CLOSED │
└─────────┘ └─────────┘
The Mazu TCP implementation adheres to these states closely and is modeled according to them.
When a segment is received from the IP layer, the corresponding connection is looked up, and a
call into a handler based on the current state of the connection is made. These handlers
manage the connection: they handle transitions between states, allocate and free connections
and receive buffers, and update the different variables in the connection structures. Their
function names start with tcp_handle_receive_.
Handling each state separately is verbose, and coalescing similar behavior into
a generic handler that treats per-state differences as special cases would reduce
code size. The trade-off is deliberate: per-state handlers mirror the RFC
specification directly, making the code straightforward to verify and debug.
Where obvious, common behavior is factored into the tcp_conn_update_*
functions.
A TCP connection receives data in the ESTABLISHED state. A dedicated task
polls the network device; when IP data arrives, the IP layer extracts the
TCP/IP pseudo header and passes it to tcp_handle_packet, which invokes the
per-state handlers described above.
When data is received by an ESTABLISHED TCP connection, it is appended to a circular buffer.
The buffer is allocated right before transitioning the connection to the ESTABLISHED state,
and it has a fixed size. The TCP implementation advertises the amount of available space to
the peer with the window size field of the TCP header. The advertised window size decreases
while the circular buffer fills up, which discourages the peer from sending more data. The
window size increases again after the caller has copied received data out of
the circular buffer via tcp_conn_recv.
A key function of TCP, besides traffic control, is ensuring reliable delivery.
tcp_conn_send calls the internal tcp_send_segment function, which allocates a
send buffer (SB) -- a data structure designed to make prepending protocol headers
easy as the packet moves down the network stack. The payload is copied into the
send buffer, and the buffer is appended to the connection's send buffer queue
(SBQ). The SBQ is a linked list where each node carries timestamps for
retransmission timing and the ACK number that must arrive before the segment can
be freed.
tcp_send_segment calls into the IP layer to transmit the segment immediately
after adding the payload to the retransmission queue. A dedicated scheduler task
periodically calls tcp_poll_retransmit, which iterates
over all active connections and their SBQs. Each node in the SBQ is processed as follows (where
one node represents one segment waiting for retransmission or acknowledgment):
- If the ACK for the segment has arrived since the last poll, the segment data is freed and the node is removed from the queue.
- If a maximum number of retransmission attempts has been reached, the segment data is freed and the node is removed from the queue.
- If neither of the two above conditions holds, and the retransmission timeout of the segment has expired, the segment is retransmitted. The timeout doubles on each retransmission (exponential backoff).
The TCP timestamps option (TSopt) measures the round-trip time (RTT) of each connection. The base retransmission timeout (RTO) is computed dynamically from RTT measurements using the algorithm in RFC 6298. If TSopt is absent, a default RTO of 1 second is used (unlikely in practice -- TSopt has been standard since 1992).
TCP congestion control is factored into a pluggable vtable (struct tcp_cc_ops) with three
callbacks: on_ack, on_dup_ack, and on_timeout. The default implementation is Reno
(RFC 5681): slow start, congestion avoidance, fast recovery on 3 duplicate ACKs, and
exponential backoff on RTO. The initial congestion window is 10 segments (RFC 6928). The
send path uses min(rwnd, cwnd) as the effective window. Per-connection cwnd and ssthresh
are exposed in the /api/tcp JSON endpoint. The vtable design allows future replacement with
Cubic or BBR without touching the TCP state machine.
TCP frequently needs new connection structures and data buffers, most of which are short-lived. The implementation uses two allocation strategies.
Connection structures live in a global array. Each entry has an in-use flag; allocation scans for an unused slot. This is simple, fast, and provides good data locality. The array also supports the frequent full-table scans required by retransmission polling and connection cleanup -- something a pool allocator is not designed for.
A connection structure is allocated when the TCP handshake begins, but no data buffers are allocated at that point. Receive buffers are allocated from a fixed-size pool after the handshake completes. The pool allocator has low overhead and strong locality.
Send buffers in the retransmission queue (SBQ) are allocated from their own pool allocator. Any number of send buffers can be allocated for a single connection, depending on how much data the caller is transmitting and how much remains unacknowledged. Send buffers are freed based on the rules above.
The pool allocators, in turn, are backed by big contiguous allocations from kvalloc. All of them
are allocated at boot. This strategy leads to low fragmentation and speedy allocations.
The RAM file system (ramfs) stores content served by the web server. The core
data structure is struct ram_fs_node:
struct ram_fs_node {
// First node in the directory if this node is of type RAM_FS_TYPE_DIR.
struct ram_fs_node *first;
// Next node in the same directory as this node. A linked list.
struct ram_fs_node *next;
enum ram_fs_node_type type;
struct str name;
// Data of the file if this node is of type RAM_FS_TYPE_FILE.
struct byte_buf data;
// Pointer back to parent FS.
struct ram_fs *fs;
};An instance of a ramfs is defined by the struct ram_fs:
struct ram_fs {
struct alloc data_alloc;
struct pool node_alloc;
struct arena scratch;
struct ram_fs_node *root;
};The data_alloc is an abstract allocator (could be any) that's used to allocate buffers for file
data and the names of nodes. The node_alloc is a pool allocator that hands out fixed-size chunks
of memory for struct ram_fs_node allocations. A ramfs is created by calling ram_fs_new. This
function takes the data_alloc as its only argument. The node_alloc is then allocated from the
data_alloc.
A separate allocator for node names would make sense because the allocation
patterns of names and data buffers differ. However, names vary in size, so a
pool allocator would waste memory (each slot must be the maximum size). Instead,
data_alloc serves all variable-length allocations.
The first and next fields of struct ram_fs_node form a tree. next
links all nodes in the same directory as a linked list. first is set only on
directory nodes and points to the first child. Subsequent children are reached
by following next pointers.
┌─ram_fs_node───────────┐
│ │
│ /web │ NULL
│ │
└───┬───────────────────┘
│
│ first
▼
┌─ram_fs_node───────────┐ ┌─ram_fs_node───────────┐
│ │ next next │ │
│ /web/index.html ├──────▶ ... ──────▶│ /web/public │ NULL
│ │ │ │
└───────────────────────┘ └───┬───────────────────┘
NULL │
│ first
▼
┌─ram_fs_node───────────┐
│ │
│ /web/public/style.css │ NULL
│ │
└───────────────────────┘
NULL
Internally, paths are represented by the struct path_name structure. It looks like this:
struct path_name {
struct str src;
// The path '/' is represented by a `struct path_name` where `n_components` is 0, the empty path.
sz n_components;
struct str *components;
bool is_absolute;
};path_name_parse takes a path string and an arena allocator, and returns a
struct path_name. The src field holds a full, unmodified copy of the path
string (allocated from the arena). The components array contains string slices
pointing into src, one per path component, with slashes stripped.
This structure makes path lookup trivial and keeps parsing cleanly separated from tree traversal -- two concerns that are easier to verify independently.
Mazu includes several runtime verification mechanisms, all gated on debug builds
(__DEBUG__ > 0) and compiled out entirely in release builds.
Lock ordering enforcement (include/mazu/lockdep.h): a per-CPU held_locks
bitmask tracks which lock levels are currently held. lockdep_acquire() asserts
that no same-or-higher-level lock is held before acquiring; lockdep_release()
asserts the lock was actually held (catches double-release). The lock hierarchy
is: IRQ(0) < PROC(1) < FD(2) < SIG(3) < WAITQ(4) < TCP(5) < SCHED(6) <
CALLOUT(7) < ALLOC(8). Violations
trigger DEBUG_ASSERT in debug builds.
Scheduler invariant checking (kernel/sched/core.c): sched_check_invariants()
runs at the end of every context switch and verifies three properties:
MutualExclusion (the selected task is in RUNNING state), SingleExecution (no
other hart is running the same task), and QueueConsistency (run queue bitmap
matches actual queue occupancy, all queued tasks are in READY state). Uses
spin_trylock to avoid deadlock during the check.
Callout telemetry (kernel/timer/callout.c): per-CPU lateness histogram bins
track how late each callout fires relative to its deadline. Six bins from 0-10us
to >100ms. Aggregated via callout_get_stats() and exposed in /api/stats
JSON. Callbacks more than 100us late are counted as missed.
mazu is available under a permissive
MIT-style license.
Use of this source code is governed by a MIT license that can be found
in the LICENSE file.