Skip to content

Client connection reuse and server-side query cancellation#152

Open
joe-clickhouse wants to merge 6 commits into
mainfrom
joe/client-reuse-improvements
Open

Client connection reuse and server-side query cancellation#152
joe-clickhouse wants to merge 6 commits into
mainfrom
joe/client-reuse-improvements

Conversation

@joe-clickhouse

@joe-clickhouse joe-clickhouse commented Mar 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

The goal of this PR is to harden the MCP server for heavy, sustained usage by reusing client connections, adding real server-side query cancellation, and preventing worker thread pool exhaustion from zombie queries.

  • Connection reuse: Cache clickhouse_connect clients by config key instead of creating a new client on every tool call. Eliminates hundreds of ms of overhead per call that a significant contributor to intermittent timeouts under sustained use.
  • Server-side query cancellation: Assign a query_id to every query and issue KILL QUERY on timeout instead of the no-op future.cancel(). Timed-out queries no longer continue running as zombies consuming worker threads and ClickHouse server resources.
  • Timeout alignment: Auto-cap send_receive_timeout to query_timeout + 5, unless explicitly overridden so worker threads unblock shortly after the MCP timeout fires, preventing thread pool exhaustion.
  • Stale client eviction: Evict cached clients on connection errors and on failed liveness pings after idle. Read-only metadata calls like list_databases and list_tables retry once with a fresh client. run_query evicts but does not retry because writes could duplicate.
  • Config resolution fix: Resolve session config overrides on the request thread where FastMCP ContextVar is available before dispatching to the worker thread, fixing a latent bug where PR Client config override support via MCP Context Session states #115 overrides were silently missed inside the executor.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the mcp_clickhouse MCP server for sustained/high-concurrency usage by reusing ClickHouse clients across tool calls, implementing true server-side query cancellation, and aligning network timeouts to prevent worker exhaustion.

Changes:

  • Added a cached clickhouse_connect client layer keyed by resolved client config, with eviction on liveness/connection failures and limited retries for safe metadata tools.
  • Implemented query ID tracking and server-side cancellation via KILL QUERY on MCP timeouts.
  • Introduced CLICKHOUSE_MCP_MAX_WORKERS and aligned send_receive_timeout to query_timeout + 5 unless explicitly overridden.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
mcp_clickhouse/mcp_server.py Implements client caching, active query tracking, cancellation via KILL QUERY, timeout alignment, and worker pool sizing/logging.
mcp_clickhouse/mcp_env.py Adds CLICKHOUSE_MCP_MAX_WORKERS configuration property.
mcp_clickhouse/__init__.py Exposes additional helpers at package level via imports/__all__.
README.md Documents the updated timeout behavior, cancellation semantics, and new max-workers config.
tests/test_client_cache.py Adds coverage for caching behavior, eviction, and timeout alignment.
tests/test_query_cancellation.py Adds coverage for query_id propagation, active query tracking, and timeout-triggered cancellation.
tests/test_context_config_override.py Ensures test isolation by clearing the client cache around config override tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mcp_clickhouse/mcp_server.py Outdated
Comment thread mcp_clickhouse/mcp_server.py
Comment thread mcp_clickhouse/mcp_server.py Outdated
Comment thread mcp_clickhouse/mcp_server.py Outdated
Comment thread mcp_clickhouse/mcp_server.py
Comment thread mcp_clickhouse/__init__.py

@peter-leonov-ch peter-leonov-ch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this PR has integration tests, if not, that'd be nice to add.

Comment thread mcp_clickhouse/mcp_env.py
CLICKHOUSE_MCP_BIND_HOST: Bind host for HTTP/SSE (default: 127.0.0.1)
CLICKHOUSE_MCP_BIND_PORT: Bind port for HTTP/SSE (default: 8000)
CLICKHOUSE_MCP_QUERY_TIMEOUT: SELECT tool timeout in seconds (default: 30)
CLICKHOUSE_MCP_MAX_WORKERS: Maximum thread pool workers for query execution (default: 10)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun fact: the JS client proxies all the CLICKHOUSE_* parameters verbatim to CH. Seems to have worked fine without the need to explicitly enumerate then. It does not forward env vars though.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the context.

Comment on lines +584 to +585
with _active_queries_lock:
entry = _active_queries.pop(query_id, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's relevant anymore, but a helper function that does both: respects the lock and mutates the guarder value might be of use.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I did consider, but the 3 lock sites each do different things under the lock so a shared helper seemed to obscure more than help.

logger.warning(
"Query %s timed out after %s seconds: %s", query_id, timeout_secs, query
)
_cancel_query(query_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code says it's reusing the same connection. I'm new here, is this a native connection or an HTTP connection? If it's the HTTP connection then running another query on a "busy" connection might not work if we're timing out on the client side.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses clickhouse-connect under the hood, so it's HTTP. However, the cached client wraps a connection pool, so the KILL grabs a different socket from the in-flight query rather than queueing. We also set autogenerate_session_id=False, which is what lets two calls on the same client run concurrently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants