-
Notifications
You must be signed in to change notification settings - Fork 213
Add scheduling CLI #4996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scheduling CLI #4996
Conversation
9d579fb to
7d6c0cb
Compare
|
Notes from a discussion with @wthrowe @knelli2: We decided to look at template inheritance https://jinja.palletsprojects.com/en/3.1.x/templates/#template-inheritance to solve the issue with generating submit scripts. |
|
Following up on discussion elsewhere: The restart option can be added without duplicating the rest of the command with |
|
I implemented the submit script generation, take a look when you get the chance |
c13fa78 to
7b56453
Compare
knelli2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested this out yet, but did a first pass of review. I'll probably have more things as I test it out.
| // Return the dir name for the Charm++ checkpoints as well as the prefix for | ||
| // checkpoint names and their padding. This is a "detail" function so that | ||
| // these pieces can be defined in one place only. | ||
| std::tuple<std::string, std::string, size_t> checkpoints_dir_prefix_pad() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you make this a full fledged PR, can you factor these changes into another commit? This is just changing out checkpoint directory structure.
support/SubmitScripts/SubmitBase.sh
Outdated
|
|
||
| /usr/bin/modulecmd bash list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is specific to Wheeler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why this isn't module list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for some reason, module list didn't work on the unlimited queue on Wheeler
support/SubmitScripts/SubmitBase.sh
Outdated
| --queue ${SLURM_JOB_PARTITION} \ | ||
| --time-limit $(squeue -j ${SLURM_JOB_ID} -h --Format TimeLimit) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a slurm env variable that will give you the current time limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no :(
| # Distributed under the MIT License. | ||
| # See LICENSE.txt for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment: Would we be able to break up this file either into smaller files for the various classes/functions and maybe even separate commits? This is a lot for one commit and it's some dense code to go through.
support/Python/Schedule.py
Outdated
| raise ValueError("Specify an 'executable' ('--executable' / '-e') " | ||
| "or list one in the input file.") from err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the input file metadata as 'Executable:'"
support/Python/Schedule.py
Outdated
| matched_submit_msg = re.match(r"Submitted batch job (\d+)", | ||
| submit_process.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the guaranteed message on all clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't seen Slurm output anything else, but we can change this if we run into issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it give a useful return status? That should be more robust than text parsing.
| - Exclusive range: "0..3" or "0..<3" (the latter is clearer, but "<" is a | ||
| special character in the shell) | ||
| - Inclusive range: "0...3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this standard notation? In bash {0..5} is inclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got this from Swift where 1..<3 is a half-open (exclusive) range and 1...3 is a closed (inclusive) range. Ranges are often a source of confusion and ambiguity, and I think the Swift syntax is quite clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this follows Swift syntax, maybe add links to Swift in the relevant places so people can go look at that as well.
support/Python/Schedule.py
Outdated
|
|
||
| default_submit_script_template = Path(__file__).parent / 'Submit.sh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this? I don't know what the .parent method of a Path does, but this looks like it's looking for a file named Submit.sh in the support/ directory? Oh also, wouldn't this behavior change if you're in PY_DEV_MODE or not because of linking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it's next to the Schedule.py file in support/. That's were CMake puts the machine-specific submit script template. It works with PY_DEV_MODE because the .parent is purely lexical (doesn't resolve symlinks). I'm open to better ideas how to copy around submit scripts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this seems fine then
support/Python/Schedule.py
Outdated
| "(see main help text for possible placeholders).")) | ||
| @click.option('--queue', help=("Name of the queue.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget, is the plan to get these defaults (queue, time limit, nodes/cores) from the Machine file? Or are we getting rid of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I got rid of any use of the machine files in this script. Instead, there's only the submit script template (one per cluster), which sets these defaults. As it stands now we could delete the machine files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll test it out and see how it works
support/Python/Schedule.py
Outdated
| @click.option('--num-procs', | ||
| '-j', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe -c? I know this is make syntax, but -c is srun syntax. Not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know either. I don't think there's a consensus among other tools. I got used to -j because of make. We could allow both, if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah do both
|
Thanks @knelli2, I implemented your suggestions and squashed because I moved commits / files / functions around. Sorry that that makes the smaller changes harder to find! |
48e678b to
3f5d656
Compare
wthrowe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does wheeler have special instructions in the initial PR comment? Is that still relevant, or have things become more uniform?
src/Parallel/Main.hpp
Outdated
| Main<Metavariables>::checkpoints_dir_prefix_pad() const { | ||
| const std::string checkpoints_dir = "Checkpoints"; | ||
| const std::string prefix = "Checkpoint"; | ||
| constexpr size_t pad = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave it at 6. Characters are cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea they're cheap but become hard to read/count at some point. Try this: just glancing at 0000 you immediately know that it's 4 zeros. But glancing at 00000 you don't, you have to actually count. So I can make it 4 characters if you think we need that many, but anything beyond that I'd like to avoid. (also keep in mind that the checkpoint/segment names are actually not set in stone, I want to make sure that none of the code really relies on a hard-coded directory structure but just takes as input the files that it needs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but why do you want to count the digits? I don't see a reason that should be a design goal.
If you don't want to discuss the counter format in this PR then don't change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think readable directory names are a design goal. No need for excessive digits in my opinion. More important for segments than for checkpoints, so if you prefer I can keep the checkpoint format unchanged in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't actually compared them, so I tried creating a list of CheckpointNNN and CheckpointNNNNNN names from 0 to 19 (I got tired of typing at that point). I actually found the six-digit version to be significantly easier to read, because the string of zeros acts as a separator between "Checkpoint" and the number. It makes it easier to visually locate the first nonzero digit. Presumably the same would hold for other prefixes.
[Continuing from a thread below, but consolidating because the topic is the same]
Three digits give us over 3 years in daily segments. That seems enough for the vast majority of simulations. As mentioned above the number of digits is not set in stone, so if we want to do a 5-year simulation or do hourly segments for some reason or something like that we can always increase the number of digits for that case.
You assume that segments will always be based on a timer. This has not always been true for SpEC, and while it is a goal for SpECTRE I don't think we should assume that we will always achieve it immediately. SpEC segments could be extremely short (minutes, maybe less) because they were based on some event in the simulation that forced a new segment.
Anyway, @markscheel, you're probably the most experienced SpEC-runner here. Do you want to weigh in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean with the zero-padding visually separating the number from the prefix. How about 4 digits for both segments and checkpoints then? Still not excessive, hence ok to type without having to pay too much attention to the number of keystrokes (which I find kind of hard for 6 digits), but surely enough and visually pleasing.
support/Python/DirectoryStructure.py
Outdated
| ``` | ||
| With SpEC we never needed more than 26*26=676 segments (AA-ZZ), so three | ||
| digits for segments is enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I vaguely think I've seen three-character SpEC segment IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen those, but that doesn't say much.
Three digits give us over 3 years in daily segments. That seems enough for the vast majority of simulations. As mentioned above the number of digits is not set in stone, so if we want to do a 5-year simulation or do hourly segments for some reason or something like that we can always increase the number of digits for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed the design comment in the other thread to try to keep the discussion more organized.
The point was trying to make here wasn't about your proposed design, but that I don't think "With SpEC we never needed more than 26*26=676 segments" is true, so you shouldn't claim it no matter what we decide for SpECTRE.
support/Python/Schedule.py
Outdated
| if not force: | ||
| raise OSError(f"File already exists at '{path}'. Delete it or " | ||
| "retry with 'force' ('--force' / '-f')." + | ||
| (("\n" + error_hint) if error_hint else "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional] If we were confident that deleting the file was the correct thing to do, we would just do it here, so telling the user to do so may be giving bad instructions. I'd just leave it at "File already exists. Retry with --force to overwrite.".
support/Python/Schedule.py
Outdated
| dest = (dest_dir / src_file.name).resolve() | ||
| if dest.exists() and not force: | ||
| raise OSError(f"File already exists at '{dest}'. Delete it or " | ||
| "retry with 'force' ('--force' / '-f').") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
support/Python/Schedule.py
Outdated
| matched_submit_msg = re.match(r"Submitted batch job (\d+)", | ||
| submit_process.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it give a useful return status? That should be more robust than text parsing.
|
Forgot to give my overall comment: I'm very happy with how the submit scripts are coming out with the templates. This fixes the various problems with inconsistencies and duplication that we currently have, so this resolves the initial issue in #2987, although that spawned a lot of extra discussion that may not be considered resolved. |
|
Pushed a fixup @wthrowe |
knelli2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple things. You can ignore the formatting one for now, but the phase change comment is because it's not behaving correctly. I have CheckpointAndExitAfterWallclock in my input file when trying to schedule just a stand alone run dir, and it doesn't error.
| - All arguments to this function, including all additional '**kwargs'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These indents are rendered weird.
- All arguments to this function, including all additional '--params'.
For example, the additional '--params' can include parameters
controlling resolution in the input file. - 'executable_name': Just
the name of the executable (basename of the 'executable' path).
Same for the rest below.
|
Pushed another fixup for resubmissions ✌️ |
0dfc692 to
050b818
Compare
|
Ok you can squash now. I imagine there will be bugs we'll need to iron out, but should be ok for now |
bd3d278 to
f4c2d4f
Compare
|
Squashed it all down. @wthrowe there were so many fixups that I lost track of what you've seen and what you haven't. I can probably dig up the git reflog if you prefer to see a single squashed-together fixup on top of your review above. |
|
The test is failing. |
To work with placeholders in input files and submit scripts.
|
Rebased! |
wthrowe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looked at the documentation, since I'm not good with python.
| incrementing integers and continue the run from the previous segment. For | ||
| example, the following is a typical 'segments_dir': | ||
| \b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably some python thing that I don't understand, but everything I can find on the internet says this is adding a backspace character. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The \b escape marker prevents click from rewrapping the following paragraph: https://click.palletsprojects.com/en/8.1.x/documentation/#preventing-rewrapping
support/Python/Schedule.py
Outdated
| copy_executable: Copy the executable to the run or segments directory. | ||
| By default (when set to 'None'): | ||
| - If '--run-dir' / '-o' is set, don't copy. | ||
| - If '-segments-dir' / '-O' is set, copy to segments directory to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in number of dashes. Again 4 lines down.
support/Python/Schedule.py
Outdated
| Returns: The 'subprocess.CompletedProcess' representing either the process | ||
| that scheduled the run, or the process that ran the executable if | ||
| 'scheduler' is 'None'. Returns 'None' if no run was scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if multiple runs are scheduled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to Returns 'None' if no or multiple runs were scheduled.
support/Python/Schedule.py
Outdated
| "Copy the executable to the run or segments directory. " | ||
| "(1) When no flag is specified: " | ||
| "If '--run-dir' / '-o' is set, don't copy. " | ||
| "If '-segments-dir' / '-O' is set, copy to segments " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dashes again.
|
Squashed in those fixes and split the changes into a few more commits |
knelli2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Excited to start using this
|
@wthrowe I think this is ready to go |
Proposed changes
Tool for scheduling runs, submitting jobs, setting up directories for segments and checkpoints, restarting from checkpoints, resubmitting jobs. Also allows to schedule ranges of runs, e.g. for convergence tests.
Try it like this:
To run a convergence test put something like
InitialGridPoints: {{ num_points }}in your input file and try this:spectre schedule INPUT_FILE -p num_points=6...8 -o './Run_p{{ num_points - 1 }}'Closes #2987.
Upgrade instructions
Jinja2in your Python environment (it's probably already installed because other packages need it).Code review checklist
make docto generate the documentation locally intoBUILD_DIR/docs/html.Then open
index.html.code review guide.
bugfixornew featureif appropriate.Further comments