Caveat: The following report was generated with heavy AI assistance and has only been lightly reviewed before posting. Nevertheless, the underlying assumptions were tested fairly thoroughly so hopefully it's correct.
Summary
Every CALL dolt_fetch('<remote>', '<branch>') against an SSH remote leaves one ssh subprocess alive, parented to the dolt sql-server process. The child stays sleeping in poll() because the server never closes the pipes connected to the child's stdin/stdout. Over many fetches these accumulate until the remote sshd / container runs out of PIDs or SSH sessions.
The leak is deterministic (1 child per fetch), occurs on the server side (not the client-side dolt sql -q that invokes the CALL), and reproduces against a plain localhost OpenSSH — it is not specific to any particular SSH endpoint, proxy, or ControlMaster configuration.
Environment
- Dolt versions reproduced:
1.84.0 (released binary)
1.86.2 (freshly built from main @ dfecd55771, Go 1.25.6)
- OS: Ubuntu 25.10, kernel 6.17.0-14-generic, x86_64
- OpenSSH: distro default (9.x)
- Remote URL shape:
ssh://user@host:port/path/to/db/.dolt
- Reproduced against:
- Remote dolt on a Railway container (via Railway TCP proxy + SSH)
- Plain local dolt DB on same host via
ssh://ubuntu@localhost:22/...
Reproduction
Minimal repro (localhost only, no external infra required):
# 1. Build / use a dolt binary. Repro confirmed on 1.84.0 and 1.86.2.
DOLT=/path/to/dolt
# 2. Create a throwaway "remote" dolt DB.
mkdir -p /tmp/dolt-leak-test && cd /tmp/dolt-leak-test
$DOLT init --initial-branch main
$DOLT sql -q "CREATE TABLE t (id INT PRIMARY KEY);
INSERT INTO t VALUES (1),(2),(3);"
$DOLT add . && $DOLT commit -m init
# 3. Create a scratch local DB and start a sql-server for it.
mkdir -p /tmp/dolt-leak-local && cd /tmp/dolt-leak-local
$DOLT init --initial-branch main
$DOLT sql-server --host 127.0.0.1 --port 13999 --loglevel warning &
SERVER_PID=$!
sleep 3
# 4. Configure an ssh:// remote pointing at the throwaway DB and
# fetch from it repeatedly. Count leaked ssh children parented
# to the sql-server.
DB=\`dolt-leak-local\`
$DOLT --host 127.0.0.1 --port 13999 --no-tls sql -q \
"USE $DB;
CALL dolt_remote('add', 'r',
'ssh://$USER@localhost:22/tmp/dolt-leak-test/.dolt')"
for i in 1 2 3; do
$DOLT --host 127.0.0.1 --port 13999 --no-tls sql -q \
"USE $DB; CALL dolt_fetch('r', 'main')" >/dev/null
sleep 1
echo "after fetch $i: $(pgrep -P $SERVER_PID -af \
'ssh.*dolt.*transfer' | wc -l) leaked ssh children"
done
Observed output: count increments by exactly 1 per fetch.
after fetch 1: 1 leaked ssh children
after fetch 2: 2 leaked ssh children
after fetch 3: 3 leaked ssh children
Same behaviour with
DOLT_SSH_COMMAND="ssh -o ControlMaster=no -o ControlPath=none -o ControlPersist=no"
— ControlMaster multiplexing is not involved.
Diagnostic findings
Each leaked child process (verified via /proc):
State: S (sleeping)
wchan: poll_schedule_timeout
PPid: <dolt sql-server PID>
Open FDs on the leaked child:
fd 0 -> pipe:[A] # stdin (read end)
fd 1 -> pipe:[B] # stdout (write end)
fd 2 -> pipe:[C] # stderr (write end)
fd 3 -> socket:[...] # TCP to sshd
Cross-checking the pipe peer FDs via /proc/*/fd/*:
- Pipe A write end is held by the
dolt sql-server process.
- Pipe B read end is held by the
dolt sql-server process.
i.e. the sql-server still holds both ends of its side of the pipe pair open after dolt_fetch has already returned success to the client. The ssh child therefore blocks forever in poll() — stdin never sees EOF, no data arrives, no incentive to exit.
Remote side: each orphan ssh client keeps a channel open to the remote sshd, which keeps a forked dolt ... transfer subprocess alive on the remote host. On a container with a tight PID cgroup (e.g. a Railway service), this rapidly exhausts the container's PID budget and every subsequent process spawn fails with:
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT
on the next invocation of any dolt command inside the container (including dolt version). That abort is a correct response from the Go runtime to PID exhaustion — but the root cause is this leak.
The crash trace from the remote in that state includes:
github.com/dolthub/dolt/go/store/nbs.(*tableSet).rebase
go/store/nbs/table_set.go:576
github.com/dolthub/dolt/go/store/nbs.newNomsBlockStore
go/store/nbs/store.go:845
github.com/dolthub/dolt/go/store/nbs.newLocalStore
go/store/nbs/store.go:763
Expected behaviour
After CALL dolt_fetch completes, the server should close its ends of the pipes to each spawned ssh child, causing the child to see stdin EOF and exit cleanly. No process should remain parented to the sql-server after the CALL returns.
Likely location
The SSH remote / chunk-transfer driver inside dolt sql-server. The os/exec.Cmd (or equivalent) spawning $DOLT_SSH_COMMAND <args> dolt ... transfer appears not to call Wait(), and/or not to close its stdin/stdout pipes after the transfer completes on the session's control path.
Workarounds
None that are in-process. Callers can mitigate by periodically killing leaked ssh ... dolt ... transfer children parented to the sql-server, or by restarting the sql-server between sync batches. Neither is a real fix.
Caveat: The following report was generated with heavy AI assistance and has only been lightly reviewed before posting. Nevertheless, the underlying assumptions were tested fairly thoroughly so hopefully it's correct.
Summary
Every
CALL dolt_fetch('<remote>', '<branch>')against an SSH remote leaves onesshsubprocess alive, parented to thedolt sql-serverprocess. The child stays sleeping inpoll()because the server never closes the pipes connected to the child's stdin/stdout. Over many fetches these accumulate until the remote sshd / container runs out of PIDs or SSH sessions.The leak is deterministic (1 child per fetch), occurs on the server side (not the client-side
dolt sql -qthat invokes the CALL), and reproduces against a plain localhost OpenSSH — it is not specific to any particular SSH endpoint, proxy, or ControlMaster configuration.Environment
1.84.0(released binary)1.86.2(freshly built frommain@dfecd55771, Go 1.25.6)ssh://user@host:port/path/to/db/.doltssh://ubuntu@localhost:22/...Reproduction
Minimal repro (localhost only, no external infra required):
Observed output: count increments by exactly 1 per fetch.
Same behaviour with
DOLT_SSH_COMMAND="ssh -o ControlMaster=no -o ControlPath=none -o ControlPersist=no"— ControlMaster multiplexing is not involved.
Diagnostic findings
Each leaked child process (verified via
/proc):Open FDs on the leaked child:
Cross-checking the pipe peer FDs via
/proc/*/fd/*:dolt sql-serverprocess.dolt sql-serverprocess.i.e. the sql-server still holds both ends of its side of the pipe pair open after
dolt_fetchhas already returned success to the client. Thesshchild therefore blocks forever inpoll()— stdin never sees EOF, no data arrives, no incentive to exit.Remote side: each orphan
sshclient keeps a channel open to the remote sshd, which keeps a forkeddolt ... transfersubprocess alive on the remote host. On a container with a tight PID cgroup (e.g. a Railway service), this rapidly exhausts the container's PID budget and every subsequent process spawn fails with:on the next invocation of any dolt command inside the container (including
dolt version). That abort is a correct response from the Go runtime to PID exhaustion — but the root cause is this leak.The crash trace from the remote in that state includes:
Expected behaviour
After
CALL dolt_fetchcompletes, the server should close its ends of the pipes to each spawnedsshchild, causing the child to see stdin EOF and exit cleanly. No process should remain parented to the sql-server after the CALL returns.Likely location
The SSH remote / chunk-transfer driver inside
dolt sql-server. Theos/exec.Cmd(or equivalent) spawning$DOLT_SSH_COMMAND <args> dolt ... transferappears not to callWait(), and/or not to close its stdin/stdout pipes after the transfer completes on the session's control path.Workarounds
None that are in-process. Callers can mitigate by periodically killing leaked
ssh ... dolt ... transferchildren parented to the sql-server, or by restarting the sql-server between sync batches. Neither is a real fix.