Skip to content

Add JDBC client throttling — analysis document, implementation and documentation#523

Merged
rrobetti merged 28 commits into
mainfrom
copilot/discuss-throttling-implementation
May 21, 2026
Merged

Add JDBC client throttling — analysis document, implementation and documentation#523
rrobetti merged 28 commits into
mainfrom
copilot/discuss-throttling-implementation

Conversation

Copilot AI commented May 16, 2026

Copy link
Copy Markdown
Contributor
  • Inspect current analysis doc examples and identify all places tied to the 120-slot scale
  • Run baseline validation commands to confirm repo state
  • Update analysis doc to use a 40-slot scale and adjust related numbers consistently
  • Run targeted post-change validation
  • Reply to the new PR comment with the addressing commit hash

Agent-Logs-Url: https://github.com/Open-J-Proxy/ojp/sessions/a19cc33b-051c-409c-b8cf-9f7e1c9a4e98

Co-authored-by: rrobetti <7221783+rrobetti@users.noreply.github.com>
Copilot AI requested a review from rrobetti May 16, 2026 16:11

@rrobetti rrobetti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluate the following idea:
Add new information to the SessionInfo returned to clients, the extra info will be 2 new fields:
1 - the number of clients connected to that datasource with same user and password (session hash)
2 - the max admission concurrent requests allowed in the server.

The OJP JDBC client shall use this info to create a semaphore and restrict how many concurrent requests it sends to OJP server, each cliend shall set its semaphore by calculating by the formula (max admission of ojp server divided by number of clients of that datasource) multiplied by the number of OJP servers in the cluster.

…IVE_THROTTLING_ANALYSIS.md

Agent-Logs-Url: https://github.com/Open-J-Proxy/ojp/sessions/474a33bb-3629-4cce-9651-8df0768af306

Co-authored-by: rrobetti <7221783+rrobetti@users.noreply.github.com>

Copilot AI commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

Evaluate the following idea:
Add new information to the SessionInfo returned to clients, the extra info will be 2 new fields:
1 - the number of clients connected to that datasource with same user and ...

Evaluated in commit 028c38e — a new section "Option: Server-Cooperative Fair-Share Throttling via SessionInfo" was added to CLIENT_REACTIVE_THROTTLING_ANALYSIS.md.

Summary of findings:

The formula works, but two things need to be pinned before implementation:

  1. Is maxAdmission per-node or cluster-aggregate? The formula multiplies by numOjpServers, so if maxAdmission is per-node the result is correct; if it's cluster-aggregate the multiplication is wrong. Must be decided upfront.
  2. What does clientCount count? It must be distinct clientUUID values (JVMs) per connHash, not raw session/connection count — otherwise a single JVM with a 50-connection pool would inflate the count by 50 and collapse everyone's limit.

Main risks identified:

  • Integer division wastes capacity: floor(20/7) = 2, leaving 6 slots permanently unused. Ceiling division + a 10% safety headroom is safer.
  • Thundering herd on client disconnect: When several clients drop, every remaining client gets a larger budget simultaneously and may burst together. A step-limited increase (AIMD-style) mitigates this.
  • Semaphore resizing is non-trivial: java.util.concurrent.Semaphore has no setPermits(). The driver must track currentLimit separately and safely release/acquire the delta.
  • Cross-node clientCount accuracy: In multinode mode, each node only sees its own clients. Recommended v1 approach: use per-node snapshot (conservative, slightly over-throttles), document it clearly, and add cluster-aggregate gossip in a later version.
  • In-transaction bypass: Mid-transaction statements must skip the local semaphore or they risk deadlocking while the server waits for commit/rollback.

Positive: clusterHealth already carries the node UP/DOWN list, so numOjpServers can be derived from it without a new field. maxAdmission = 0 can cleanly mean "admission control not configured → no client-side limit."

Overall this design is materially better than purely reactive for steady-state behavior. The protocol cost (two int32 fields on SessionInfo) is minimal.

Copilot AI changed the title Add server-triggered JDBC throttling options analysis document Add JDBC client throttling analysis documents May 17, 2026
Copilot AI requested a review from rrobetti May 17, 2026 07:54
@rrobetti

Copy link
Copy Markdown
Contributor

@copilot
1 - I believe maxAdmission is per-node, check the code and confirm.
2 - clientUUID values (JVMs) per connHash is correct.

Use Ceiling division + a 10% safety headroom and document the rational.

Do include A step-limited increase (AIMD-style) and document the rational.

In relation to "Semaphore resizing is non-trivial" is there any cheap(efficient) alternative to control concurrency that can be easily resized and that does not add latency in the flow?

"Cross-node clientCount accuracy" I do not understand what this means, explain in simple language with simple example.

"In-transaction bypass" , I am not sure if this is a problem, explain in simple language with a simple example.

Keep only the analysis for now, do not change the code yet.

…ve, AIMD, plain-language examples

Agent-Logs-Url: https://github.com/Open-J-Proxy/ojp/sessions/221f3b7b-572e-4c3a-a6b3-19471bba8684

Co-authored-by: rrobetti <7221783+rrobetti@users.noreply.github.com>

Copilot AI commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

@copilot
1 - I believe maxAdmission is per-node, check the code and confirm.
2 - clientUUID values (JVMs) per connHash is correct.
...

Updated in commit c9b629c. Here's what changed:

1. maxAdmission is per-node — confirmed.
SlotManager.totalSlots is set to actualPoolSize (the HikariCP connection pool size on this specific node). The formula multiplying by numOjpServers is therefore correct. This is now documented in the proposal section, marked resolved in Q1, and the Cons #4 "ambiguous" note has been replaced with a confirmation.

2. clientUUID per connHash — confirmed and resolved in C1.

3. Ceiling division + 10% headroom. S1 now explains the rationale: floor division leaves capacity permanently wasted; ceiling slightly over-allocates, which the 10% headroom compensates for by absorbing one stale clientCount error.

4. AIMD step-limited increase. S4 now has the full algorithm: decrease applies immediately (fast overload response), increase is capped at min(newLimit, currentLimit + 1) per SessionInfo update. Rationale is documented with an example showing how 4 simultaneous disconnects without AIMD would spike all remaining clients at once.

5. Semaphore alternative. The new Con #5 shows that a simple AtomicInteger counter + volatile int limit is the right choice for fail-fast (non-blocking) control. Resizing is a single volatile write — zero overhead on the happy path (one atomic increment + one comparison). No drain/inject logic needed.

6. Cross-node clientCount — plain-language example added. Con #8 now shows concretely: App1→Node A, App2→Node B, each node sees only its own client (clientCount = 1), both clients compute (10/1)*2 = 20 permits but total cluster capacity is only 20. Safe v1 approach: accept per-node snapshot; server's own SlotManager remains the final safety gate.

7. In-transaction bypass — plain-language example added. C5 now walks through the step-by-step scenario where Thread A holds an open transaction, the client semaphore fills, Thread A blocks waiting for a permit, and the server's transaction timeout fires before Thread A ever sends its next statement. Fix: skip the inFlight check when autoCommit == false.

@rrobetti rrobetti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot we should have two modes configurable, proactive and reactive, in the proactive mode we do the client trottling with the atomic integer and fail fast as suggested, if reactive mode I want to explore and hear your opinions on keeping a record of how many simultaneous connections was the server serving before one got an admission timout, the goal with that is to send a smarter number to the client instead of the fixed total the server was configured to take, in real scenarios ig might break before that, for example, if ojp server was being able to serve 50 simultaneous requests without errors but when it tried to serve 51 it got a admission timout error, it means its limit is 50 and that is the number we should send to the client. Tell me your concerns, opinions, sugestions and questions about this new idea.

Copilot AI commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

@copilot we should have two modes configurable, proactive and reactive, in the proactive mode we do the client trottling with the atomic integer and fail fast as suggested, if reactive mode I want to ...

Evaluated and documented in commit 014e56e. The new sections added are:

Two-Mode Framework — a short overview table comparing proactive, reactive, and combined modes, showing what each requires and recommends.

Reactive Mode with observedPeak — Deep Analysis — full evaluation of the idea:

How it would work in SlotManager:

  • Add highWaterMark (tracks peak in-flight count) and observedPeak (initialized to maxAdmission)
  • On wait-timeout: observedPeak = max(floor, min(observedPeak, currentActive))
  • AIMD recovery: every totalSlots × 2 successful releases: observedPeak = min(maxAdmission, observedPeak + 1)
  • Sent as a third int32 field in SessionInfo; clients use it in place of maxAdmission in the fair-share formula

Main concerns raised:

  1. "False floor" collapse — a single slow query or DB blip can drive observedPeak to near-zero. Fix: 10% floor (max(1, totalSlots * 0.1)).
  2. Recovery rate tuning — too fast = burst risk, too slow = wasted capacity. K should be configurable.
  3. Timeout path ambiguity — only the semaphore wait-timeout is a clean signal; queue-depth rejections may produce misleading counts.
  4. SQS interaction — a slow-lane timeout ≠ total server overload; using activeSlow + activeFast total is safer than per-lane.
  5. Multiple concurrent timeouts raceAtomicInteger.updateAndGet with a pre-snapshotted currentActive handles this correctly.

Opinion (confidence: 85%): The idea is sound — it's the TCP CWND model applied to OJP. The "false floor" and recovery rate are the only significant risks and both have clear mitigations. The recommended v1 is combined mode: send both maxAdmission and observedPeak, driver uses min(proactiveLimit, reactiveLimit). Reactive-only (no clientCount tracking) is simpler and could be the right first cut if fairness between clients is not the immediate priority.

Copilot AI requested a review from rrobetti May 17, 2026 09:50

@rrobetti rrobetti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot the documents are too verbose, reduce the volume of information and focus on the latest ideas and solutions we're discussing, keep notes of previous ideas considered and why they were dropped.

Copilot AI commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

@copilot the documents are too verbose, reduce the volume of information and focus on the latest ideas and solutions we're discussing, keep notes of previous ideas considered and why they were dropped...

Done in commit c265470.

  • CLIENT_REACTIVE_THROTTLING_ANALYSIS.md: 1012 → 177 lines. Now structured as: protocol additions → proactive mode (formula, AtomicInteger, AIMD, in-transaction bypass, cross-node caveat) → reactive mode (observedPeak, risks table, mitigations) → combined mode → open questions → summary table → "Dropped Approaches" (purely client-reactive, Semaphore, floor division — each with a one-line reason why dropped).
  • JDBC_SERVER_TRIGGERED_THROTTLING_OPTIONS.md: 222 → 74 lines. Now a concise options-considered log with a forward reference to the current design doc. Resolved option sets (signaling method, computation method, queue scope) are collapsed to one-liners showing what was chosen and why alternatives were skipped.

…uent requests client-side

Agent-Logs-Url: https://github.com/Open-J-Proxy/ojp/sessions/5b54bf63-7220-422d-b3c4-eeb145e5330d

Co-authored-by: rrobetti <7221783+rrobetti@users.noreply.github.com>

Copilot AI commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@copilot
I have run tests with the tool https://github.com/rrobetti/stressar/ and it is returning errors when running with this pr version of OJP

Error Breakdown

...

Fixed in commit f663238.

Root cause: When the server returned RESOURCE_EXHAUSTED (slot admission timeout), the exception propagated through Statement.executeQuery/executeUpdate but nothing updated ClientThrottleManager's reactive limit. The reactive limit stayed at MAX_VALUE, so the very next request was allowed through and hit the still-overloaded server again — causing the repeated server-side rejections you saw.

What changed:

  1. ClientThrottleManager.notifyServerOverload() — new method that applies an AIMD multiplicative decrease: halves reactiveLimit (e.g. 8 → 4 → 2 → 1). If reactiveLimit was uninitialised (MAX_VALUE), it seeds from half the proactive limit so the client immediately backs off on the very first overload signal.

  2. Statement.executeQuery/executeUpdate — now catch StatusRuntimeException(RESOURCE_EXHAUSTED) and call notifyServerOverload(). After the first server rejection, the client's reactive limit drops to half, and subsequent requests are blocked client-side (SQLTransientException) without touching the server.

  3. PreparedStatement.executeQuery/executeUpdate — were completely unguarded before (no acquireThrottle call at all). They now have the same throttle acquire/release and notifyServerOverload() protection as Statement.

  4. Extracted a shared onServerOverload() helper in Statement (protected, so PreparedStatement can reuse it) to avoid the duplicated catch block across four methods.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkstyle.org
    • Triggering command: /opt/hostedtoolcache/CodeQL/2.25.4/x64/codeql/tools/linux64/java/bin/java /opt/hostedtoolcache/CodeQL/2.25.4/x64/codeql/tools/linux64/java/bin/java -jar /opt/hostedtoolcache/CodeQL/2.25.4/x64/codeql/xml/tools/xml-extractor.jar --fileList=/tmp/codeql-scratch-0fe3065af824c7c2/dbs/java/working/files-to-index5345095830394566210.list --sourceArchiveDir=/tmp/codeql-scratch-0fe3065af824c7c2/dbs/java/src --outputDir=/tmp/codeql-scratch-0fe3065af824c7c2/dbs/java/trap/java (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@rrobetti rrobetti marked this pull request as ready for review May 20, 2026 19:50
@rrobetti

Copy link
Copy Markdown
Contributor

@copilot the errors remain, I noticed they only occurr when slow query segregation is enabled.

…ion=0; use fastSlots for maxAdmission

Root cause 1 (driver): ClientThrottleManager.updateFromSessionInfo() was resetting
reactiveLimit = MAX_VALUE whenever it received a SessionInfo with maxAdmission=0. Since
executeUpdate/executeQuery responses always carry maxAdmission=0, this silently undid
every notifyServerOverload() adjustment on the first successful SQL response.
Fix: return early when maxAdmission <= 0.

Root cause 2 (server): ConnectAction sent maxAdmission = totalSlots regardless of the
SQS split. With default 20% slow slots, fastSlots=8 out of 10 total. The client's
proactiveLimit (90% of 10 = 9) exceeded the 8 fast-lane slots, triggering immediate
admission timeouts before the reactive throttle had a chance to activate.
Fix: SlotManager.getEffectiveMaxAdmission() returns fastSlots when slowSlots>0,
totalSlots otherwise; ConnectAction now calls this method.

Analysis doc updated to reflect both fixes.

Agent-Logs-Url: https://github.com/Open-J-Proxy/ojp/sessions/895e1b8d-3550-47aa-b460-84b5d8511181

Co-authored-by: rrobetti <7221783+rrobetti@users.noreply.github.com>

Copilot AI commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@copilot the errors remain, I noticed they only occurr when slow query segregation is enabled.

Fixed in commit 8aaaab3. Two root causes identified:

Root cause 1 (driver)ClientThrottleManager.updateFromSessionInfo() was resetting reactiveLimit = MAX_VALUE on every SQL response. executeUpdate and executeQuery responses carry a minimal SessionInfo with maxAdmission=0; the old else branch treated that as "uninitialised" and reset the limit, silently undoing every notifyServerOverload() reduction on the first successful query. Fix: early return when maxAdmission <= 0.

Root cause 2 (server)ConnectAction always sent maxAdmission = totalSlots regardless of the SQS fast/slow split. With the default 20% slow split (fastSlots=8, totalSlots=10) the client's proactive limit (floor(10 × 0.9) = 9) exceeded the 8 fast-lane slots, causing an immediate admission timeout on the 9th concurrent request. This is why the problem only appeared with SQS enabled — without SQS, fastSlots = totalSlots so proactiveLimit = 9 < 10 → no overflow. Fix: SlotManager.getEffectiveMaxAdmission() returns fastSlots when slowSlots > 0, totalSlots otherwise; ConnectAction now calls this method.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkstyle.org
    • Triggering command: /opt/hostedtoolcache/CodeQL/2.25.4/x64/codeql/tools/linux64/java/bin/java /opt/hostedtoolcache/CodeQL/2.25.4/x64/codeql/tools/linux64/java/bin/java -jar /opt/hostedtoolcache/CodeQL/2.25.4/x64/codeql/xml/tools/xml-extractor.jar --fileList=/tmp/codeql-scratch-0fe3065af824c7c2/dbs/java/working/files-to-index10445916390145968206.list --sourceArchiveDir=/tmp/codeql-scratch-0fe3065af824c7c2/dbs/java/src --outputDir=/tmp/codeql-scratch-0fe3065af824c7c2/dbs/java/trap/java (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@rrobetti

Copy link
Copy Markdown
Contributor

@copilot review your last fix, I have double the load and the errors now appear in both SQS and non SQS

non SQS

Error Breakdown

Instance Error type Count First error message
0 StatusRuntimeException 4 CANCELLED: Thread interrupted
1 StatusRuntimeException 5 RESOURCE_EXHAUSTED: Timeout waiting for admission control slot for operation: 1895a7989b63b646
2 StatusRuntimeException 3 CANCELLED: Thread interrupted
3 StatusRuntimeException 5 RESOURCE_EXHAUSTED: Timeout waiting for admission control slot for operation: 1895a7989b63b646

SQS

Error Breakdown

Instance Error type Count First error message
0 SQLException 1 Connection admission timeout for hash: Ld--4pNhqOb-WVk0RbA8aOPs_VIISdRyE87pgVYf_p8 after 5000ms (phase=admission)
0 StatusRuntimeException 27 RESOURCE_EXHAUSTED: Timeout waiting for fast operation slot for operation: 1895a7989b63b646
1 StatusRuntimeException 26 RESOURCE_EXHAUSTED: Timeout waiting for fast operation slot for operation: 19ddf4bbc5b8c770
2 StatusRuntimeException 27 RESOURCE_EXHAUSTED: Timeout waiting for fast operation slot for operation: e7282124a6fbdd3c
3 StatusRuntimeException 23 RESOURCE_EXHAUSTED: Timeout waiting for fast operation slot for operation: e7282124a6fbdd3c

@rrobetti

Copy link
Copy Markdown
Contributor

@copilot create a simple and easy to follow with flow diagram mermaid to explain how the throttling works. Use simple examples to explain. avoid being too verbose.

Copilot AI commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

@copilot create a simple and easy to follow with flow diagram mermaid to explain how the throttling works. Use simple examples to explain. avoid being too verbose.

Added a simple Mermaid flow plus two short examples in documents/analysis/CLIENT_REACTIVE_THROTTLING_ANALYSIS.md to explain throttling in an easy-to-follow way. Commit: ccd8cfa.

@rrobetti rrobetti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot

In documents/analysis/CLIENT_REACTIVE_THROTTLING_ANALYSIS.md

Unable to render rich display

Parse error on line 3:
...local limit
ceil(effectiveAdmission/
-----------------------^
Expecting 'SQE', 'DOUBLECIRCLEEND', 'PE', '-)', 'STADIUMEND', 'SUBROUTINEEND', 'PIPE', 'CYLINDEREND', 'DIAMOND_STOP', 'TAGEND', 'TRAPEND', 'INVTRAPEND', 'UNICODE_TEXT', 'TEXT', 'TAGSTART', got 'PS'

For more information, see https://docs.github.com/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams

Copilot AI commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

@copilot

In documents/analysis/CLIENT_REACTIVE_THROTTLING_ANALYSIS.md...

Fixed in 8d29ff7. I replaced the Mermaid node labels with parser-safe quoted text (removed HTML line breaks and formula symbols from node bodies), and verified the diagram block now uses GitHub-compatible syntax.

spikes when load increases suddenly.

**Without client throttling (problem):** 50 threads across 5 app servers hit one OJP node
at once → the node's 10-slot pool is overwhelmed → 40 requests queue or timeout → the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot this phrase makes it sound like OJP can only handle 10 connections. rephrase it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit 4748fa6. I rephrased that line to make clear the 120-slot value is just an example configuration, not an OJP hard limit.

I --> F
```

**Example 1 (normal):**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot use higher numbers for the examples, when you talk about 1 concurrent connection, that does not show much scalability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit 4748fa6. I replaced the low-scale examples with higher-concurrency numbers (for example 240 threads/12 app servers and limits like 27) to better reflect scalability.


---

## Current Design: Two Configurable Modes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot three configurable modes when including the combined mode.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit 4748fa6. I changed the section wording to “Three Configurable Modes” and aligned the follow-up sentence accordingly.

@rrobetti rrobetti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot now you went to the opposite level 120-slot pool is too much for an example use 40 and adapt everything else to thatscale in the analysis file

Copilot AI commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

@copilot now you went to the opposite level 120-slot pool is too much for an example use 40 and adapt everything else to thatscale in the analysis file

Updated in 1ddbd94. I changed the analysis examples to a 40-slot scale and adjusted the related numbers consistently (intro scenario, computed limit, and normal-mode example).

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants