-
Notifications
You must be signed in to change notification settings - Fork 617
fix: Do not release lock when job finished #3744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Do not release lock when job finished #3744
Conversation
cc8023f to
6d5267c
Compare
|
One idea I had was also to ignore |
yes, absolutely! |
📝 WalkthroughWalkthroughReplace count-based scheduler bookkeeping ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Runner as JobRunner
participant Scheduler as JobScheduler
participant RL as JobRateLimiter
participant Sem as open_jobs (Semaphore)
rect rgb(245,245,255)
note left of Scheduler: Submission (rate-limited path)
Runner->>Scheduler: submit(job with `_job_slot`)
Scheduler->>RL: record_submission(job, local?/non-local?)
RL-->>Scheduler: allow / queue
alt dispatch
Scheduler->>Sem: acquire()
Sem-->>Scheduler: granted
Scheduler->>Runner: dispatch job
end
end
rect rgb(245,255,245)
note left of Scheduler: Completion (local)
Runner->>Scheduler: job_finished(local)
Scheduler->>RL: notify_completion(local)
Scheduler->>Sem: release() %% release only for local jobs
Scheduler->>Scheduler: update_available_resources (decrement `_job_slot`)
end
rect rgb(255,245,245)
note left of Scheduler: Completion (non-local)
Runner->>Scheduler: job_finished(non-local)
Scheduler->>RL: notify_completion(non-local)
Scheduler--xSem: no immediate release
Scheduler->>Scheduler: update_available_resources (decrement `_job_slot`)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (46)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Great, here is my idea I will work on implementing:
From my understanding, this will result in the desired behaviour:
Edits/Notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
src/snakemake/scheduling/job_scheduler.py (2)
585-591: Assertion message nit: say “job slots” instead of “job count”.Update the message to match the new resource name and avoid confusion.
- ), f"Job count is {self.resources['_job_slot']}, but should be {sys.maxsize}" + ), f"Job slots is {self.resources['_job_slot']}, but should be {sys.maxsize}"
597-601: Register only remote jobs: good; minor cleanups.
- Typo in comment (“remtoe”).
- Avoid building a list just to take its length.
- # update job rate limiter with remtoe jobs only - self.job_rate_limiter.register_jobs(len([job for job in selected if not job.is_local])) + # update job rate limiter with remote jobs only + n_remote = sum(1 for job in selected if not job.is_local) + if n_remote: + self.job_rate_limiter.register_jobs(n_remote)src/snakemake/jobs.py (1)
120-121: Drop unused_job_countfield –res_dict["_job_count"]isn’t consumed anywhere in the repo; remove the assignment at src/snakemake/jobs.py:120 and plan full removal (or deprecation warning and removal in the next major release).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/snakemake/jobs.py(2 hunks)src/snakemake/scheduling/job_scheduler.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
src/snakemake/scheduling/job_scheduler.pysrc/snakemake/jobs.py
🧬 Code graph analysis (1)
src/snakemake/scheduling/job_scheduler.py (2)
src/snakemake/jobs.py (6)
name(1285-1286)name(1671-1675)resources(492-513)resources(1520-1534)is_local(1035-1043)is_local(1682-1683)src/snakemake/workflow.py (3)
name(2326-2331)resources(2242-2247)is_local(704-708)
🪛 GitHub Check: Check for spelling errors
src/snakemake/jobs.py
[failure] 108-108:
ocupy ==> occupy
🪛 GitHub Actions: Codespell
src/snakemake/jobs.py
[error] 108-108: codespell: ocupy ==> occupy (typo found by codespell)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (45)
- GitHub Check: tests (10, ubuntu-latest, py313)
- GitHub Check: tests (10, ubuntu-latest, py312)
- GitHub Check: tests (10, windows-2022, py313)
- GitHub Check: tests (9, windows-2022, py313)
- GitHub Check: tests (9, ubuntu-latest, py313)
- GitHub Check: tests (10, ubuntu-latest, py311)
- GitHub Check: tests (8, ubuntu-latest, py313)
- GitHub Check: tests (9, ubuntu-latest, py312)
- GitHub Check: tests (7, ubuntu-latest, py313)
- GitHub Check: tests (9, ubuntu-latest, py311)
- GitHub Check: tests (7, windows-2022, py313)
- GitHub Check: tests (8, windows-2022, py313)
- GitHub Check: tests (6, windows-2022, py313)
- GitHub Check: tests (6, ubuntu-latest, py312)
- GitHub Check: tests (8, ubuntu-latest, py312)
- GitHub Check: tests (8, ubuntu-latest, py311)
- GitHub Check: tests (6, ubuntu-latest, py313)
- GitHub Check: tests (7, ubuntu-latest, py311)
- GitHub Check: tests (7, ubuntu-latest, py312)
- GitHub Check: tests (6, ubuntu-latest, py311)
- GitHub Check: tests (6, macos-latest, py313)
- GitHub Check: tests (5, ubuntu-latest, py313)
- GitHub Check: tests (5, windows-2022, py313)
- GitHub Check: tests (3, windows-2022, py313)
- GitHub Check: tests (4, macos-latest, py313)
- GitHub Check: tests (4, windows-2022, py313)
- GitHub Check: tests (4, ubuntu-latest, py312)
- GitHub Check: tests (5, ubuntu-latest, py311)
- GitHub Check: tests (4, ubuntu-latest, py311)
- GitHub Check: tests (5, ubuntu-latest, py312)
- GitHub Check: tests (4, ubuntu-latest, py313)
- GitHub Check: tests (3, ubuntu-latest, py313)
- GitHub Check: tests (3, macos-latest, py313)
- GitHub Check: tests (2, ubuntu-latest, py313)
- GitHub Check: tests (2, macos-latest, py313)
- GitHub Check: tests (3, ubuntu-latest, py312)
- GitHub Check: tests (2, ubuntu-latest, py312)
- GitHub Check: tests (1, ubuntu-latest, py311)
- GitHub Check: tests (3, ubuntu-latest, py311)
- GitHub Check: tests (1, windows-2022, py313)
- GitHub Check: tests (1, macos-latest, py313)
- GitHub Check: tests (2, ubuntu-latest, py311)
- GitHub Check: tests (1, ubuntu-latest, py312)
- GitHub Check: tests (1, ubuntu-latest, py313)
- GitHub Check: apidocs
🔇 Additional comments (3)
src/snakemake/scheduling/job_scheduler.py (2)
501-505: Do not free_job_slot: LGTM.Excluding
_job_slotfrom free logic is correct since slots are governed by the rate limiter’s timespan, not by job completion.
604-607: Skip_job_slotin resource deduction: LGTM.This preserves the rate limiter as the single source of truth for slot capacity within a timespan.
src/snakemake/jobs.py (1)
119-119: Non-local jobs occupy one job slot: LGTM.This aligns job-level accounting with the scheduler’s
_job_slotcapacity.
| else: | ||
| # Finished local job could mean free execution slot | ||
| if job.is_local: | ||
| self._open_jobs.release() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: remote job completions no longer wake the scheduler when not rate-limited.
With this change, a normal (non–rate-limited) run won’t release the semaphore on remote finishes, risking the main loop blocking indefinitely after the first scheduling round.
Apply:
- else:
- # Finished local job could mean free execution slot
- if job.is_local:
- self._open_jobs.release()
+ else:
+ # Without a rate limiter, any completion should wake the scheduler.
+ # With a rate limiter, only local completions should wake it; remote
+ # completions are governed by the timespan timer to avoid log spam.
+ if self.job_rate_limiter is None or job.is_local:
+ self._open_jobs.release()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| else: | |
| # Finished local job could mean free execution slot | |
| if job.is_local: | |
| self._open_jobs.release() | |
| else: | |
| # Without a rate limiter, any completion should wake the scheduler. | |
| # With a rate limiter, only local completions should wake it; remote | |
| # completions are governed by the timespan timer to avoid log spam. | |
| if self.job_rate_limiter is None or job.is_local: | |
| self._open_jobs.release() |
🤖 Prompt for AI Agents
In src/snakemake/scheduling/job_scheduler.py around lines 517-521, the code only
releases self._open_jobs for local job completions which causes remote
completions to not wake the scheduler when the run is not rate-limited; change
the condition so that the semaphore is released for any finished job when the
scheduler is not currently rate-limited (e.g., replace the current if
job.is_local: release() with a check like if job.is_local or not
self.<rate_limited_flag_or_method> : self._open_jobs.release()), ensuring you
use the correct existing rate-limit attribute/method on the scheduler.
|
For the sake of completeness i made the following to enable testing things related to remote execusion: Snakefile: yaml profile: taskspooler status wrapper: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/snakemake/scheduling/job_scheduler.py (1)
517-518: Critical: unconditional semaphore release defeats rate limiting.The unconditional
release()causes the scheduler to wake on every remote job completion, even when rate-limited. This produces the excessive "Job rate limit reached..." log spam described in the PR objectives. When rate-limited, only local job completions should wake the scheduler—remote completions are governed by the timer at line 373.Apply this diff to condition the wake-up:
else: - self._open_jobs.release() + # Without a rate limiter, any completion should wake the scheduler. + # With a rate limiter, only local completions should wake it; remote + # completions are governed by the timespan timer to avoid log spam. + if self.job_rate_limiter is None or job.is_local: + self._open_jobs.release()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/snakemake/scheduling/job_scheduler.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
src/snakemake/scheduling/job_scheduler.py
🧬 Code graph analysis (1)
src/snakemake/scheduling/job_scheduler.py (1)
src/snakemake/jobs.py (2)
is_local(1035-1043)is_local(1682-1683)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (46)
- GitHub Check: tests (10, ubuntu-latest, py313)
- GitHub Check: tests (10, ubuntu-latest, py311)
- GitHub Check: tests (10, ubuntu-latest, py312)
- GitHub Check: tests (7, ubuntu-latest, py311)
- GitHub Check: tests (10, windows-2022, py313)
- GitHub Check: tests (9, windows-2022, py313)
- GitHub Check: tests (5, ubuntu-latest, py311)
- GitHub Check: tests (9, ubuntu-latest, py313)
- GitHub Check: tests (8, ubuntu-latest, py311)
- GitHub Check: tests (9, ubuntu-latest, py312)
- GitHub Check: tests (8, macos-latest, py313)
- GitHub Check: tests (5, macos-latest, py313)
- GitHub Check: tests (8, ubuntu-latest, py312)
- GitHub Check: tests (8, windows-2022, py313)
- GitHub Check: tests (7, ubuntu-latest, py313)
- GitHub Check: tests (8, ubuntu-latest, py313)
- GitHub Check: tests (7, ubuntu-latest, py312)
- GitHub Check: tests (6, ubuntu-latest, py312)
- GitHub Check: tests (5, windows-2022, py313)
- GitHub Check: tests (6, ubuntu-latest, py313)
- GitHub Check: tests (6, windows-2022, py313)
- GitHub Check: tests (4, macos-latest, py313)
- GitHub Check: tests (9, ubuntu-latest, py311)
- GitHub Check: tests (7, windows-2022, py313)
- GitHub Check: tests (6, macos-latest, py313)
- GitHub Check: tests (2, windows-2022, py313)
- GitHub Check: tests (5, ubuntu-latest, py312)
- GitHub Check: tests (4, windows-2022, py313)
- GitHub Check: tests (6, ubuntu-latest, py311)
- GitHub Check: tests (5, ubuntu-latest, py313)
- GitHub Check: tests (3, ubuntu-latest, py311)
- GitHub Check: tests (4, ubuntu-latest, py313)
- GitHub Check: tests (2, ubuntu-latest, py313)
- GitHub Check: tests (3, windows-2022, py313)
- GitHub Check: tests (4, ubuntu-latest, py311)
- GitHub Check: tests (4, ubuntu-latest, py312)
- GitHub Check: tests (3, ubuntu-latest, py312)
- GitHub Check: tests (3, ubuntu-latest, py313)
- GitHub Check: tests (1, macos-latest, py313)
- GitHub Check: tests (1, ubuntu-latest, py313)
- GitHub Check: tests (2, ubuntu-latest, py311)
- GitHub Check: tests (2, ubuntu-latest, py312)
- GitHub Check: tests (1, ubuntu-latest, py311)
- GitHub Check: tests (1, windows-2022, py313)
- GitHub Check: tests (1, ubuntu-latest, py312)
- GitHub Check: formatting
🔇 Additional comments (6)
src/snakemake/scheduling/job_scheduler.py (6)
103-103: Good: slot-based resource model introduced.The rename from
_job_countto_job_slotaligns with the PR's objective to separate local and remote job tracking. Based on past reviews, ensure that_job_countis fully replaced throughout the codebase, particularly insrc/snakemake/jobs.pyaround line 120.
503-504: LGTM: correct resource freeing behavior.Excluding
_job_slotfrom resource freeing is correct because slot availability is managed by theJobRateLimitertimer, not by individual job completions.
585-588: LGTM: assertion updated correctly.The assertion now correctly verifies that
_job_slotremains unrestricted when not rate-limited, consistent with the slot-based model.
595-595: LGTM: slot resource updated before scheduling.Correctly sets
_job_slotto available capacity before job selection when rate-limited.
598-598: LGTM: local rules correctly excluded from rate limiting.Filtering to register only remote jobs with the rate limiter implements the PR objective and discussion agreement to exclude
localrules from rate limiting (which is primarily intended for SLURM/cluster submissions).
603-607: LGTM: slot resource excluded from job-level updates.Correctly excludes
_job_slotfrom per-job resource accounting, as it's managed byJobRateLimiter.register_jobs()at line 598.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
Is this ready for a merge? |
|
@johanneskoester let me do another test today on our cluster just to make sure. |
|
From my testing i still need to do a bit of work here. The local rules are properly excluded from the rate limit, but still there is an issue with how often the rate limitter is being "hit" in such a way it prints "Job rate limit reached, waiting for free slots. From what i can tell #3421 also might have som overlap with any changes i might make in what resources are assigned to local rules to allow propper job selection. I will wait until it is merged to look at this any further. |
Hi,
I've observed an issue with the rate limiter where the main loop is being rerun by the scheduler before the specified wait time has elapsed.
Initially, I suspected the excessive logging of the message "Job rate limit reached..." was the culprit. However, after further investigation, it appears that the main loop is actually triggered multiple times, not just excessively logged.
Running the following with
--max-jobs-per-timespan "1/1s"will produce the excesive logging messages. (They can become mucm much worse than this. As a workaound i removed the logging on our HPC)I believe the issue may stem from the semaphore being released when a job finishes. It's unclear to me why the release occurs at this point, as a completed job doesn't necessarily imply that a new slot in the rate limiter has become available.
In this PR, I've removed the semaphore release to initiate a discussion. I'm happy to explore other solutions or make additional changes — any clarification or feedback would be appreciated.
Thanks,
A happy Snakemake user
Summary by CodeRabbit
New Features
Performance
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.