Skip to content

Bugfix: Check for pending model loads before re-attempting#53

Merged
TheConverseEngineer merged 4 commits into
masterfrom
check-model-loads
Jun 10, 2026
Merged

Bugfix: Check for pending model loads before re-attempting#53
TheConverseEngineer merged 4 commits into
masterfrom
check-model-loads

Conversation

@TheConverseEngineer

Copy link
Copy Markdown
Contributor

This fixes a group of bugs that exhibited the following behavior:

  1. Long model load times would cause multiple load commands to be queued before the first load finished, resulting in multiple loads (which stalls inference)
  2. Tasks are uninterruptible while in an load request, causing miniray to timeout and fail while the task is waiting to complete redundant loads
  3. Even if miniray does successfully kill the task, Triton server still honors the load requests and wastes time and cuda memory trying to complete them, causing subsequent tasks to fail.

This change prevents load requests from being issued when a model of that name is already being loaded. Furthermore, retry attempts now poll instead of blocking at the endpoint, which makes them now cancel-able by miniray.

Comment thread lib/triton_helpers.py Outdated
@haraschax

Copy link
Copy Markdown
Contributor

This just fixes the blocking nature of the follow-up requests, not the original load request?

Comment thread lib/triton_helpers.py Outdated
def load_triton_model(client: InferenceServerClient, model: str, config: ModelConfig):
def load_triton_model(client: InferenceServerClient, model: str, config: ModelConfig, load_timeout = 60):
if _is_model_loading(client, model):
# If model is loading, wait at most load_timeout for it to finish

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment doesn't really add anything for me, code seems sellf-explanatory.

Comment thread lib/triton_helpers.py Outdated
def load_triton_model(client: InferenceServerClient, model: str, config: ModelConfig, load_timeout = 60):
if _is_model_loading(client, model):
# If model is loading, wait at most load_timeout for it to finish
deadline = time.time() + load_timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.time isn't monotonic. time.perf_counter is

@TheConverseEngineer

Copy link
Copy Markdown
Contributor Author

Correct, however the original load request can be made with a very short timeout now to avoid the 30 second threshold that miniray uses to determine if a canceled job is canceled

Comment thread lib/triton_helpers.py Outdated
# If model is loading, wait at most load_timeout for it to finish
deadline = time.time() + load_timeout
while time.time() < deadline and _is_model_loading(client, model):
time.sleep(min(5, load_timeout / 5))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very random choice of sleep time and difficult to read. Why not just do time.sleep(5)?

@TheConverseEngineer TheConverseEngineer merged commit 4d183eb into master Jun 10, 2026
1 check passed
@TheConverseEngineer TheConverseEngineer deleted the check-model-loads branch June 10, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants