Summary
When a client holding a named lock (GET_LOCK) dies without closing its connection (e.g. kill -9, crash, OOM-kill), dolt sql-server retains the dead client's session — and its named lock — indefinitely (observed 5+ minutes with no release in a minimal repro; 10+ minutes in the production incident that led us here). Any other connection's GET_LOCK on the same name times out for the duration. The only remediation we found is a manual server-side KILL <conn-id>.
Version / platform
dolt version 2.1.2 (linux/amd64), server started as dolt sql-server --host 127.0.0.1 --port <port>. (We have not yet tested newer releases; apologies if this is already addressed.)
Repro (minimal)
dolt sql -q "CREATE DATABASE testdb"
dolt sql-server --host 127.0.0.1 --port 13310 &
# Client A: acquire a named lock, then go idle holding the session open
( echo "SELECT GET_LOCK('repro_lock',5);"; sleep 600 ) | mysql -h 127.0.0.1 -P 13310 -u root testdb &
# Verify held:
mysql -h 127.0.0.1 -P 13310 -u root -N -e "SELECT IS_USED_LOCK('repro_lock')" # -> 2 (client A's conn id)
# Kill client A hard (dead peer, no TCP FIN from the client process):
kill -9 <mysql client pid>
# From a live connection:
mysql -h 127.0.0.1 -P 13310 -u root -N -e "SELECT GET_LOCK('repro_lock',5)" # -> 0 (timeout, blocked)
mysql -h 127.0.0.1 -P 13310 -u root -N -e "SELECT IS_USED_LOCK('repro_lock')" # -> 2, for 5+ minutes after the kill
Observed timeline: kill at T+0; IS_USED_LOCK still reports the dead conn at T+5m10s (end of observation window). KILL 2 releases it immediately.
Why it matters
The session is idle (no in-flight query), so nothing prompts the server to read from the dead socket; with no TCP keepalive / dead-peer reaping at the session layer, the named lock outlives its owner until an operator intervenes. Clients that serialize on named locks (schema-migration mutexes, leader election, etc.) turn one crashed client into an indefinite fleet-wide stall. We hit this in production behind a client-side migration mutex: one dead client process held GET_LOCK('bd_schema_init:<db>') for 10+ minutes, starving every other client's store-open until a manual KILL.
Expected
One (or more) of:
- Enable TCP keepalive on client connections so dead peers are detected and their sessions reaped within a bounded interval (releasing session-scoped locks);
- A configurable idle-session / dead-peer timeout at the server session layer;
- At minimum, documentation that named locks can outlive crashed clients indefinitely, with
KILL <conn> as the remediation.
Happy to re-run the repro against a newer build or with additional instrumentation if useful.
Summary
When a client holding a named lock (
GET_LOCK) dies without closing its connection (e.g.kill -9, crash, OOM-kill),dolt sql-serverretains the dead client's session — and its named lock — indefinitely (observed 5+ minutes with no release in a minimal repro; 10+ minutes in the production incident that led us here). Any other connection'sGET_LOCKon the same name times out for the duration. The only remediation we found is a manual server-sideKILL <conn-id>.Version / platform
dolt version 2.1.2(linux/amd64), server started asdolt sql-server --host 127.0.0.1 --port <port>. (We have not yet tested newer releases; apologies if this is already addressed.)Repro (minimal)
Observed timeline: kill at T+0;
IS_USED_LOCKstill reports the dead conn at T+5m10s (end of observation window).KILL 2releases it immediately.Why it matters
The session is idle (no in-flight query), so nothing prompts the server to read from the dead socket; with no TCP keepalive / dead-peer reaping at the session layer, the named lock outlives its owner until an operator intervenes. Clients that serialize on named locks (schema-migration mutexes, leader election, etc.) turn one crashed client into an indefinite fleet-wide stall. We hit this in production behind a client-side migration mutex: one dead client process held
GET_LOCK('bd_schema_init:<db>')for 10+ minutes, starving every other client's store-open until a manualKILL.Expected
One (or more) of:
KILL <conn>as the remediation.Happy to re-run the repro against a newer build or with additional instrumentation if useful.