The core challenge with token-based rate limiting is that we don't know the token count until after the AI request is complete. This means we can only update the usage counter in Limitador after the resources have already been used, making it too late to block the request. The current implementation mitigates this by using a two-call approach: a preliminary check during the request phase determines if the limit is breached, and a second call post-response updates the counter. While effective, this pattern introduces the overhead of an additional Limitador call per request, resulting in two separate calls for every AI request.
Sequence Diagram: Token rate limiting and auth