Skip to content

sql-server: named lock (GET_LOCK) held by a crashed client is retained indefinitely — idle dead session never reaped #11194

@Wldc4rd

Description

@Wldc4rd

Summary

When a client holding a named lock (GET_LOCK) dies without closing its connection (e.g. kill -9, crash, OOM-kill), dolt sql-server retains the dead client's session — and its named lock — indefinitely (observed 5+ minutes with no release in a minimal repro; 10+ minutes in the production incident that led us here). Any other connection's GET_LOCK on the same name times out for the duration. The only remediation we found is a manual server-side KILL <conn-id>.

Version / platform

dolt version 2.1.2 (linux/amd64), server started as dolt sql-server --host 127.0.0.1 --port <port>. (We have not yet tested newer releases; apologies if this is already addressed.)

Repro (minimal)

dolt sql -q "CREATE DATABASE testdb"
dolt sql-server --host 127.0.0.1 --port 13310 &

# Client A: acquire a named lock, then go idle holding the session open
( echo "SELECT GET_LOCK('repro_lock',5);"; sleep 600 ) | mysql -h 127.0.0.1 -P 13310 -u root testdb &

# Verify held:
mysql -h 127.0.0.1 -P 13310 -u root -N -e "SELECT IS_USED_LOCK('repro_lock')"   # -> 2 (client A's conn id)

# Kill client A hard (dead peer, no TCP FIN from the client process):
kill -9 <mysql client pid>

# From a live connection:
mysql -h 127.0.0.1 -P 13310 -u root -N -e "SELECT GET_LOCK('repro_lock',5)"     # -> 0 (timeout, blocked)
mysql -h 127.0.0.1 -P 13310 -u root -N -e "SELECT IS_USED_LOCK('repro_lock')"   # -> 2, for 5+ minutes after the kill

Observed timeline: kill at T+0; IS_USED_LOCK still reports the dead conn at T+5m10s (end of observation window). KILL 2 releases it immediately.

Why it matters

The session is idle (no in-flight query), so nothing prompts the server to read from the dead socket; with no TCP keepalive / dead-peer reaping at the session layer, the named lock outlives its owner until an operator intervenes. Clients that serialize on named locks (schema-migration mutexes, leader election, etc.) turn one crashed client into an indefinite fleet-wide stall. We hit this in production behind a client-side migration mutex: one dead client process held GET_LOCK('bd_schema_init:<db>') for 10+ minutes, starving every other client's store-open until a manual KILL.

Expected

One (or more) of:

  • Enable TCP keepalive on client connections so dead peers are detected and their sessions reaped within a bounded interval (releasing session-scoped locks);
  • A configurable idle-session / dead-peer timeout at the server session layer;
  • At minimum, documentation that named locks can outlive crashed clients indefinitely, with KILL <conn> as the remediation.

Happy to re-run the repro against a newer build or with additional instrumentation if useful.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions