Skip to content

dolt sql-server leaks one ssh child process per CALL dolt_fetch against an ssh:// remote #10897

@aspiers

Description

@aspiers

Caveat: The following report was generated with heavy AI assistance and has only been lightly reviewed before posting. Nevertheless, the underlying assumptions were tested fairly thoroughly so hopefully it's correct.

Summary

Every CALL dolt_fetch('<remote>', '<branch>') against an SSH remote leaves one ssh subprocess alive, parented to the dolt sql-server process. The child stays sleeping in poll() because the server never closes the pipes connected to the child's stdin/stdout. Over many fetches these accumulate until the remote sshd / container runs out of PIDs or SSH sessions.

The leak is deterministic (1 child per fetch), occurs on the server side (not the client-side dolt sql -q that invokes the CALL), and reproduces against a plain localhost OpenSSH — it is not specific to any particular SSH endpoint, proxy, or ControlMaster configuration.

Environment

  • Dolt versions reproduced:
    • 1.84.0 (released binary)
    • 1.86.2 (freshly built from main @ dfecd55771, Go 1.25.6)
  • OS: Ubuntu 25.10, kernel 6.17.0-14-generic, x86_64
  • OpenSSH: distro default (9.x)
  • Remote URL shape: ssh://user@host:port/path/to/db/.dolt
  • Reproduced against:
    1. Remote dolt on a Railway container (via Railway TCP proxy + SSH)
    2. Plain local dolt DB on same host via ssh://ubuntu@localhost:22/...

Reproduction

Minimal repro (localhost only, no external infra required):

# 1. Build / use a dolt binary.  Repro confirmed on 1.84.0 and 1.86.2.
DOLT=/path/to/dolt

# 2. Create a throwaway "remote" dolt DB.
mkdir -p /tmp/dolt-leak-test && cd /tmp/dolt-leak-test
$DOLT init --initial-branch main
$DOLT sql -q "CREATE TABLE t (id INT PRIMARY KEY);
              INSERT INTO t VALUES (1),(2),(3);"
$DOLT add . && $DOLT commit -m init

# 3. Create a scratch local DB and start a sql-server for it.
mkdir -p /tmp/dolt-leak-local && cd /tmp/dolt-leak-local
$DOLT init --initial-branch main
$DOLT sql-server --host 127.0.0.1 --port 13999 --loglevel warning &
SERVER_PID=$!
sleep 3

# 4. Configure an ssh:// remote pointing at the throwaway DB and
#    fetch from it repeatedly.  Count leaked ssh children parented
#    to the sql-server.
DB=\`dolt-leak-local\`
$DOLT --host 127.0.0.1 --port 13999 --no-tls sql -q \
  "USE $DB;
   CALL dolt_remote('add', 'r',
     'ssh://$USER@localhost:22/tmp/dolt-leak-test/.dolt')"

for i in 1 2 3; do
  $DOLT --host 127.0.0.1 --port 13999 --no-tls sql -q \
    "USE $DB; CALL dolt_fetch('r', 'main')" >/dev/null
  sleep 1
  echo "after fetch $i: $(pgrep -P $SERVER_PID -af \
    'ssh.*dolt.*transfer' | wc -l) leaked ssh children"
done

Observed output: count increments by exactly 1 per fetch.

after fetch 1: 1 leaked ssh children
after fetch 2: 2 leaked ssh children
after fetch 3: 3 leaked ssh children

Same behaviour with
DOLT_SSH_COMMAND="ssh -o ControlMaster=no -o ControlPath=none -o ControlPersist=no"
— ControlMaster multiplexing is not involved.

Diagnostic findings

Each leaked child process (verified via /proc):

State: S (sleeping)
wchan: poll_schedule_timeout
PPid:  <dolt sql-server PID>

Open FDs on the leaked child:

fd 0 -> pipe:[A]       # stdin  (read end)
fd 1 -> pipe:[B]       # stdout (write end)
fd 2 -> pipe:[C]       # stderr (write end)
fd 3 -> socket:[...]   # TCP to sshd

Cross-checking the pipe peer FDs via /proc/*/fd/*:

  • Pipe A write end is held by the dolt sql-server process.
  • Pipe B read end is held by the dolt sql-server process.

i.e. the sql-server still holds both ends of its side of the pipe pair open after dolt_fetch has already returned success to the client. The ssh child therefore blocks forever in poll() — stdin never sees EOF, no data arrives, no incentive to exit.

Remote side: each orphan ssh client keeps a channel open to the remote sshd, which keeps a forked dolt ... transfer subprocess alive on the remote host. On a container with a tight PID cgroup (e.g. a Railway service), this rapidly exhausts the container's PID budget and every subsequent process spawn fails with:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT

on the next invocation of any dolt command inside the container (including dolt version). That abort is a correct response from the Go runtime to PID exhaustion — but the root cause is this leak.

The crash trace from the remote in that state includes:

github.com/dolthub/dolt/go/store/nbs.(*tableSet).rebase
  go/store/nbs/table_set.go:576
github.com/dolthub/dolt/go/store/nbs.newNomsBlockStore
  go/store/nbs/store.go:845
github.com/dolthub/dolt/go/store/nbs.newLocalStore
  go/store/nbs/store.go:763

Expected behaviour

After CALL dolt_fetch completes, the server should close its ends of the pipes to each spawned ssh child, causing the child to see stdin EOF and exit cleanly. No process should remain parented to the sql-server after the CALL returns.

Likely location

The SSH remote / chunk-transfer driver inside dolt sql-server. The os/exec.Cmd (or equivalent) spawning $DOLT_SSH_COMMAND <args> dolt ... transfer appears not to call Wait(), and/or not to close its stdin/stdout pipes after the transfer completes on the session's control path.

Workarounds

None that are in-process. Callers can mitigate by periodically killing leaked ssh ... dolt ... transfer children parented to the sql-server, or by restarting the sql-server between sync batches. Neither is a real fix.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions