gojaccl is a Go implementation of the JACCL collective communication model for Apple RDMA over Thunderbolt. It provides a small Go API, a daemon-backed resource owner for macOS Thunderbolt RDMA, and operator proof packets for the hardware paths that require explicit physical validation.
The public package is jaccl. It exposes a small synchronous API around a live
communication group:
NewGroupandNewGroupFromEnvinitialize a rank.Barrier,Send, andRecvprovide control and byte-oriented point-to-point operations.NewSendWriterandNewRecvReaderexpose point-to-point traffic asio.Writerandio.Readerstreams.AllSum,AllMax,AllMin, andAllGatherprovide typed collectives over the supportedElementtypes.
The implementation keeps RDMA details internal. Callers pass ordinary Go slices; the backend copies through persistent mmap-backed staging buffers and does not register caller heap memory in the hot path.
Streaming uses the standard library:
w, err := g.NewSendWriter(ctx, 1)
if err != nil {
return err
}
if _, err := io.Copy(w, file); err != nil {
_ = w.Close()
return err
}
return w.Close()The receiving rank uses NewRecvReader and io.Copy. Use bufio.NewReader
or bufio.NewWriter around the returned values when buffering is useful.
cmd/jaccld is the daemon path for macOS Thunderbolt RDMA resource ownership.
It keeps the device, protection domain, and global registered slab in one
process and serves local clients over a Unix-domain socket. The daemon IPC
protocol leases and maps the slab, exposes explicit resource session leases,
then asks the daemon-owned RDMA transport to send, receive, or synchronize over
slab offsets. Daemon-backed collectives submit asynchronous work and wait for
completion over the same control connection. The IPC surface also has an
explicit maintenance request for the gated data-QP maintenance operation; it is
not a background keepalive.
The daemon is the production deployment model, not optional gold-plating. Apple's
Thunderbolt RDMA provider releases protection domains incompletely — allocation
fails after roughly sixty transfers and only a reboot recovers it — caps a host
at 100 memory regions, and degrades idle queue pairs after about twenty-three
minutes. A boot-scoped daemon that owns exactly one device, protection domain,
and registered slab keeps those scarce, leaky resources under a single accountant
so a job does not exhaust them mid-run. The default socket is user-scoped
($XDG_RUNTIME_DIR/jaccld/jaccld.sock, or /tmp/jaccld-$UID/jaccld.sock), so two
users' daemons do not collide on a shared host; override it with
JACCL_DAEMON_SOCKET or the -socket flag.
Backend selection is explicit. Empty or auto uses the current direct backend;
direct selects it intentionally. daemon selects the IPC client backend for
barrier, point-to-point operations, and daemon-supported collectives.
Daemon clients can be configured with only rank, size, and daemon socket; they
do not need the direct backend coordinator or RDMA device matrix.
The source/static and non-live Go surface is production-ready for provider-free behavior, daemon IPC/control-plane behavior, bounded diagnostics, and operator documentation. The local and remote non-live gates passed during closeout:
GOWORK=off go test -run '^$' ./...
GOWORK=off go test -count=1 ./...
GOWORK=off go vet ./...
CGO_ENABLED=0 GOWORK=off go test -count=1 ./...
GOWORK=off go test -race -count=1 ./...The nested examples/mlx-go-two-m4 module also passed
GOWORK=off go test -count=1 ./.... The broad go test ./... gates may query
Darwin provider availability through internal/rdma, but live RDMA integration
gates remain manual.
The current bounded two-host Apple Thunderbolt RDMA proof passed at commit
42977085e1ebce882ea264019d491f97d37d6f93. It used local provider rdma_en3,
remote provider rdma_en1, physical route interfaces en3 and en1, alias
addresses 169.254.240.245 and 169.254.240.246, and selected GID index 1
on both hosts. The proof completed physical setup, RTR, ipc_listen,
daemon-backed smoke, and 120 same-data-QP maintenance rounds per rank over
7200 seconds. All 240 round status files were zero and cleanup found no
run-owned processes.
Evidence:
/Users/tmc/tmp/gojaccl-rdma-soak-current-20260526T181725Z/proof/final-summary.json;/Users/tmc/tmp/gojaccl-rdma-soak-current-20260526T181725Z.tar.gz;- SHA256
849c0b677510270267da6062c2f38da8c5c66780cd9b7737b148d58c15bf6713.
Treat matrix behavior, arbitrary device pairs, arbitrary route/GID/cable layouts, and future binaries as explicit non-claims until a fresh bounded artifact proves that exact scope.
Hardware RDMA tests are intentionally not part of ordinary validation on macOS. Tests that transition queue pairs to RTR require explicit one-shot operator confirmation and a real physical topology. The top-level integration tests do not run a local loopback RTR experiment by default; Apple Thunderbolt RDMA is a point-to-point link between hosts, not a same-host loopback fabric.
JACCL_TEST_RDMA=1 JACCL_TEST_RDMA_ALLOW_RTR=1 go test -run '^TestIntegration' .To run a physical test, start one TestIntegrationChild process per host with
JACCL_TEST_RDMA_CHILD=1 and JACCL_TEST_RDMA_ALLOW_RTR=1, distinct
JACCL_TEST_RANK values, and the same reachable JACCL_TEST_COORDINATOR.
The canonical topology input is an explicit JSON device matrix in
JACCL_TEST_RDMA_DEVICES. That matrix may describe a complete mesh, or a sparse
connected topology supported by the backend. The legacy single-device shorthand
JACCL_TEST_RDMA_DEVICE is accepted only for two-rank integration helpers,
because it expands to a complete matrix. Use JACCL_TEST_RDMA_DEVICES for every
three-or-more-rank attempt. A sparse three-host line should leave the
endpoint-to-endpoint entries empty in the matrix; no additional topology flag is
required.
Before any physical three-or-more-rank attempt, validate the matrix offline:
go run ./cmd/jacclproof topology -file devices.jsonThis prints selected topology, rank count, directed and empty edge counts, wire counts, and the matrix SHA-256. It does not open RDMA devices, start a coordinator, allocate queue pairs, or authorize RTR.
macOS Thunderbolt RDMA provider failures can leave uninterruptible processes, so do not run the RTR gate casually.
For the current two-M4 setup, do not model a missing third host. Generate an explicit two-rank matrix for the cables that are physically connected:
go run ./cmd/jacclproof devices \
-ranks 2 \
-devices rdma_en1,rdma_en3 \
> /tmp/gojaccl-two-m4-devices.json
go run ./cmd/jacclproof topology -file /tmp/gojaccl-two-m4-devices.jsonIf the attached cable maps to different RDMA device names on the two hosts,
override the directed rows instead of forcing a symmetric name. For example, if
rank 0 uses rdma_en3 and rank 1 uses rdma_en2:
go run ./cmd/jacclproof devices \
-ranks 2 \
-edge 0,1=rdma_en3 \
-edge 1,0=rdma_en2 \
> /tmp/gojaccl-two-m4-devices.jsonThe topology report lists both devices and primary_devices. The direct
backend currently opens the first usable device listed for each peer edge, so a
dual-cable matrix is useful topology evidence but not a claim that both cables
carried datapath traffic in one run. Use separate metadata packets for each
device and keep soak claims on an explicit bounded proof path.
The hardware proof mode starts a static daemon rank with explicit rank metadata:
jaccld -rank 0 -size 2 -coordinator 127.0.0.1:9000That is not the final once-per-boot production topology model. Production
jaccld should start once per host boot with hardware and IPC options, then
admit planner-supplied collective sessions dynamically after startup. Those
sessions carry only session ID, epoch, peer control endpoints, optional device
hints, and a deadline; rank assignment, topology choice, and peer selection stay
outside jaccld. Until rank, size, and peer selection move out of startup, the
daemon-backed dynamic-topology production claim remains unproved.
The default production control plane uses loopback tcpchan addresses,
normally through SSH local forwards between hosts. Non-loopback coordinators are
rejected unless -allow-remote-tcpchan is set after an explicit jacclctl tcp-diagnostic proof. Direct non-loopback tcpchan is currently proven only
for the documented two-host rdma_en1 IP pair.
Before attempting a new hardware path, collect provider metadata without moving any queue pair to RTR:
jacclctl rdma-metadata -device rdma_en1 -max-gids 1024
jacclctl rdma-metadata -device rdma_en3 -max-gids 64This opens the device and queries port/GID metadata only. It does not allocate PDs, MRs, CQs, or QPs, and it does not post work requests.
For cross-host evidence, use the jacclproof packet command instead of ad hoc
commands:
go run ./cmd/jacclproof rdma-metadata \
-device rdma_en1 \
-remote <peer-ssh> \
-remote-tmp <peer-tmp-dir> \
-expected-selected-gid-index <expected-gid-index>The command preserves a timestamped artifact under ~/tmp and still does not
authorize RTR. Its final evaluator only classifies metadata collection.
The next no-RTR preflight is allocation-only:
go run ./cmd/jacclproof rdma-alloc \
-device rdma_en2 \
-remote-device rdma_en3 \
-remote <peer-ssh> \
-remote-tmp <peer-tmp-dir>This packet allocates and tears down a protection domain, memory region, completion queue, and queue pair on each host. It does not transition the queue pair to RTR and does not post work requests.
If metadata succeeds but allocation fails with
rdma provider returned nil handle, classify the run as a provider resource
initialization failure. This is earlier than RTR: no destination JSON exists,
no peer address is consumed, and the result is not an errno 60 transition
failure. The error includes the device name so artifacts can distinguish
rdma_en2 from rdma_en3 after reboot or cable changes. RTR diagnostic
artifacts record this as failure_class=provider_nil_handle.
After allocation passes, an INIT-only packet can prove the first local QP state transition without crossing into RTR:
go run ./cmd/jacclproof rdma-init \
-device rdma_en2 \
-remote-device rdma_en3 \
-remote <peer-ssh> \
-remote-tmp <peer-tmp-dir>RTR, RTS, and datapath work requests remain separate hardware gates.
To isolate provider INIT->RTR failures without starting jaccld or using
the tcpchan side channel, run the bounded RTR diagnostic on both hosts and
exchange the destination JSON files out of band:
JACCLCTL_RDMA_RTR_DIAGNOSTIC_ONE_SHOT=one-shot-rtr \
jacclctl rdma-rtr-diagnostic \
-device rdma_en3 \
-route-interface en0 \
-peer-route-interface en0 \
-artifact /tmp/gojaccl-rtr-diag-local \
-peer-destination /tmp/gojaccl-rtr-diag-local/peer-destination.json \
-allow-rtrThe command writes local-destination.json, waits for
peer-destination.json, runs only the RESET-to-INIT and INIT-to-RTR queue-pair
transitions, and writes rtr-diagnostic-report.json. It does not move the
queue pair to RTS, post work requests, or claim datapath success. Provider
errors include symbolic errno text, such as errno 60 (ETIMEDOUT), and the QP
transition mask.
If the command fails before writing local-destination.json with
rdma provider returned nil handle, stop at the allocation boundary above. Do
not interpret that artifact as RTR evidence. A true INIT-to-RTR timeout is
classified separately as failure_class=rtr_errno_60.
An operator can trigger the explicit maintenance operation through the daemon socket:
jacclctl maintain -timeout 5sThe one-shot hardware reproof packet is jacclproof rdma-soak. It runs the
safe gates with live RDMA test environment cleared, metadata packet, direct TCP
diagnostic, supervised daemons, pre/post smoke, 60-second maintenance cadence,
stats captures, postflight, cleanup, and artifact packaging. It refuses without
CONFIRM_RDMA_EN1_SOAK_ONE_SHOT=one-shot-soak.
Operators can inspect daemon resource leases and jaccld-observed provider slot counters without touching RDMA hardware:
jacclctl statsThe slot ledger is scoped to the current OS boot and to resources allocated by
jaccld itself. It reports protection-domain, memory-region, queue-pair, and
completion-queue opens, close calls, failures, outstanding opens, and resources
live in the current daemon process. It does not claim to see slots consumed by
unrelated processes.
Daemon-backed integration tests use the same TestIntegrationChild helper as
the direct backend, with JACCL_BACKEND=daemon and JACCL_DAEMON_SOCKET set to
the local daemon socket for each rank. Custom daemon socket paths must be placed
in an owner-only directory.
-no-rdma starts only the IPC server and slab allocator for hardware-free
smoke tests.
Daemon-backed RDMA_WRITE heartbeats are disabled by default and are not the production keepalive path on Apple Thunderbolt RDMA, whose observed registered memory has remote key zero. Background same-data-QP SEND/RECV heartbeats are also rejected because receive matching is remote FIFO, and WR IDs are local completion metadata, not wire tags.
The intended hardware envelope is explicit same-data-QP maintenance, not a
background heartbeat: two Apple Thunderbolt RDMA hosts, admission stopped on all
ranks, peer locks held, side-channel pre/post barriers, and fail-closed route
poisoning on any provider, CQ, barrier, or maintenance error. Historical
artifacts proved this shape for captured rdma_en1 binaries, and the
current-head proof extends it to the tested local rdma_en3 to remote
rdma_en1 topology at commit
42977085e1ebce882ea264019d491f97d37d6f93. Later commits still do not inherit
that physical proof automatically.
The current production-ready claim covers source/static behavior and the named two-host Apple Thunderbolt RDMA proof. RDMA_WRITE heartbeat production readiness, arbitrary rank counts, arbitrary device layouts, arbitrary non-loopback deployments, and matrix coverage remain excluded until separately proven by a fresh bounded artifact.
This module depends on the released github.com/tmc/apple module. Local
experiments may use an uncommitted workspace or command-line module override,
but committed module metadata must stay portable.
Design and validation artifacts live under docs/:
docs/go-jaccl-spec.mddocs/go-package-files.mddocs/jaccld.mddocs/jaccld-dynamic-topology.mddocs/jaccld-keepalive.mddocs/jaccld-data-qp-keepalive.mddocs/operator-runbook.mddocs/rdma-apple-thunderbolt.mddocs/production-readiness.mddocs/performance-baseline.md
Generated transcripts, timestamped proof packets, and local audit closeouts are kept out of git. Preserve those under the artifact directory for the specific run.