Skip to content

Composite client role mappings endpoint is slow and degrades under concurrency with many client roles #47157

@askosyrskiy

Description

@askosyrskiy

Before reporting an issue

  • I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.

Area

admin/api

Describe the bug

We use a single realm with one client where client roles represent fine-grained permissions tied to individual resources (e.g., per-site or per-entity access controls). As resources are onboarded, new client roles are created — we currently have ~2,400 client roles and this count grows continuously. Users are assigned ~20 direct roles, roughly a dozen of which are composite roles each bundling ~16-17 leaf roles, resulting in ~212 effective roles per user. Our services rely on the Admin API endpoint GET /admin/realms/{realm}/users/{user-id}/role-mappings/clients/{client-id}/composite to resolve a user's effective client role mappings.

The getCompositeClientRoleMappings() implementation iterates over all client roles (not just the user's assigned roles) and calls user.hasRole() on each one. hasRole() in turn delegates to KeycloakModelUtils.searchFor(), which recursively expands composite roles using a fresh HashSet on every invocation — there is no memoization across calls. This produces O(C × M × D) complexity, where C is the total number of client roles, M is the number of the user's direct role mappings, and D is the composite expansion depth. In our case, this means roughly 800,000 recursive role-containment checks per single API call.

Under concurrency, this causes severe latency degradation. A single request completes in ~1s, but under moderate load (10-60 parallel requests), response times spike to 7-23s due to CPU saturation and GC pressure from the large number of short-lived HashSet allocations. The database is not the bottleneck — we observe zero DB queries during these requests and a 99.96% Infinispan cache hit ratio. The problem is purely algorithmic: the work is proportional to the total number of client roles in the realm rather than the number of roles assigned to the user, and it will get progressively worse as more client roles are added.

Related issues:

Version

26.5.5

Regression

  • The issue is a regression

Expected behavior

The endpoint should return effective client role mappings in time proportional to the user's role count, not the total number of client roles in the realm. RoleUtils.expandCompositeRoles() already exists in the codebase and performs BFS expansion in O(M × D) — it should be used here instead of iterating all client roles with user.hasRole().

Actual behavior

ClientRoleMappingsResource.getCompositeClientRoleMappings() (line 130-145 in services/src/main/java/org/keycloak/services/resources/admin/ClientRoleMappingsResource.java) iterates all client roles via client.getRolesStream() and filters with user.hasRole(). Each hasRole() call triggers recursive composite expansion through KeycloakModelUtils.searchFor() with a new HashSet per invocation.

Current code:

Stream<RoleModel> roles = client.getRolesStream();   // ALL client roles (2,400+)
return roles.filter(user::hasRole).map(toBriefRepresentation);  // hasRole() per role

Proposed fix using existing RoleUtils.expandCompositeRoles():

Set<RoleModel> directRoles = user.getRoleMappingsStream().collect(Collectors.toSet());
Set<RoleModel> effectiveRoles = RoleUtils.expandCompositeRoles(directRoles);
return effectiveRoles.stream()
        .filter(r -> r.isClientRole() && r.getContainerId().equals(client.getId()))
        .map(toBriefRepresentation);

This changes the algorithm from O(C × M × D) to O(M × D + C), eliminating ~800,000 recursive checks per request.

Validated results (patched image deployed to production-like environment, 6 replicas):

Scenario Before After Improvement
Sequential (warm cache) 1.009s 0.115s 8.8x
60 concurrent avg 6.822s 0.604s 11x
60 concurrent p95 8.794s 0.663s 13x
Sustained 60 rps / 30s avg ~7-23s 0.144s 50-160x

Responses are byte-for-byte identical between patched and unpatched versions (212 roles, same IDs and content).

How to Reproduce?

  1. Create a realm with a client containing ~2,400+ roles
  2. Create ~12 composite roles, each bundling ~16-17 leaf client roles
  3. Assign ~20 direct roles (mix of composite and leaf) to a user, resulting in ~212 effective roles
  4. Call GET /admin/realms/{realm}/users/{user-id}/role-mappings/clients/{client-id}/composite
  5. Observe ~1s latency for a single request
  6. Send 60 concurrent requests to the same endpoint
  7. Observe average latency of 6-7s, p95 of 8-9s

The latency scales with the total number of client roles, not the user's role count. Adding more client roles (even unrelated to the user) increases per-request latency.

Anything else?

We have a patched image deployed and validated in our environment. We plan to submit a PR with the fix and integration tests. The fix uses RoleUtils.expandCompositeRoles() which already exists in the codebase (introduced via work on composite role optimization) but was never applied to this specific endpoint.

Note: this issue was investigated and the fix developed with assistance from Claude (Anthropic AI). The algorithmic analysis, patching, deployment, and load test validation were performed by a human engineer.

Metadata

Metadata

Assignees

Type

No fields configured for bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions