Skip to content

Conversation

@nilsvu
Copy link
Member

@nilsvu nilsvu commented May 12, 2023

Proposed changes

Tool for scheduling runs, submitting jobs, setting up directories for segments and checkpoints, restarting from checkpoints, resubmitting jobs. Also allows to schedule ranges of runs, e.g. for convergence tests.

Try it like this:

spectre schedule INPUT_FILE -o ./Run

To run a convergence test put something like InitialGridPoints: {{ num_points }} in your input file and try this:

spectre schedule INPUT_FILE -p num_points=6...8 -o './Run_p{{ num_points - 1 }}'

Closes #2987.

Upgrade instructions

  • Install Jinja2 in your Python environment (it's probably already installed because other packages need it).

Code review checklist

  • The code is documented and the documentation renders correctly. Run
    make doc to generate the documentation locally into BUILD_DIR/docs/html.
    Then open index.html.
  • The code follows the stylistic and code quality guidelines listed in the
    code review guide.
  • The PR lists upgrade instructions and is labeled bugfix or
    new feature if appropriate.

Further comments

@nilsvu nilsvu force-pushed the py_schedule branch 8 times, most recently from 9d579fb to 7d6c0cb Compare May 15, 2023 17:03
@nilsvu
Copy link
Member Author

nilsvu commented May 15, 2023

Notes from a discussion with @wthrowe @knelli2: We decided to look at template inheritance https://jinja.palletsprojects.com/en/3.1.x/templates/#template-inheritance to solve the issue with generating submit scripts.

@wthrowe
Copy link
Member

wthrowe commented May 15, 2023

Following up on discussion elsewhere: The restart option can be added without duplicating the rest of the command with ${SPECTRE_CHECKPOINT:+ +restart "${SPECTRE_CHECKPOINT}"}. (The :+ expansion gives the stuff to its right, but only if the variable on the left is defined and nonempty.)

@nilsvu
Copy link
Member Author

nilsvu commented May 16, 2023

I implemented the submit script generation, take a look when you get the chance

@nilsvu nilsvu force-pushed the py_schedule branch 3 times, most recently from c13fa78 to 7b56453 Compare May 16, 2023 17:04
@nilsvu nilsvu marked this pull request as ready for review May 16, 2023 23:32
Copy link
Contributor

@knelli2 knelli2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested this out yet, but did a first pass of review. I'll probably have more things as I test it out.

Comment on lines 143 to 148
// Return the dir name for the Charm++ checkpoints as well as the prefix for
// checkpoint names and their padding. This is a "detail" function so that
// these pieces can be defined in one place only.
std::tuple<std::string, std::string, size_t> checkpoints_dir_prefix_pad()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you make this a full fledged PR, can you factor these changes into another commit? This is just changing out checkpoint directory structure.

Comment on lines 39 to 40

/usr/bin/modulecmd bash list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is specific to Wheeler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this isn't module list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for some reason, module list didn't work on the unlimited queue on Wheeler

Comment on lines 72 to 75
--queue ${SLURM_JOB_PARTITION} \
--time-limit $(squeue -j ${SLURM_JOB_ID} -h --Format TimeLimit) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a slurm env variable that will give you the current time limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no :(

Comment on lines +1 to +2
# Distributed under the MIT License.
# See LICENSE.txt for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment: Would we be able to break up this file either into smaller files for the various classes/functions and maybe even separate commits? This is a lot for one commit and it's some dense code to go through.

Comment on lines 345 to 346
raise ValueError("Specify an 'executable' ('--executable' / '-e') "
"or list one in the input file.") from err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the input file metadata as 'Executable:'"

Comment on lines 563 to 522
matched_submit_msg = re.match(r"Submitted batch job (\d+)",
submit_process.stdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the guaranteed message on all clusters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen Slurm output anything else, but we can change this if we run into issues

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it give a useful return status? That should be more robust than text parsing.

Comment on lines +582 to +644
- Exclusive range: "0..3" or "0..<3" (the latter is clearer, but "<" is a
special character in the shell)
- Inclusive range: "0...3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this standard notation? In bash {0..5} is inclusive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this from Swift where 1..<3 is a half-open (exclusive) range and 1...3 is a closed (inclusive) range. Ranges are often a source of confusion and ambiguity, and I think the Swift syntax is quite clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this follows Swift syntax, maybe add links to Swift in the relevant places so people can go look at that as well.

Comment on lines 18 to 25

default_submit_script_template = Path(__file__).parent / 'Submit.sh'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this? I don't know what the .parent method of a Path does, but this looks like it's looking for a file named Submit.sh in the support/ directory? Oh also, wouldn't this behavior change if you're in PY_DEV_MODE or not because of linking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it's next to the Schedule.py file in support/. That's were CMake puts the machine-specific submit script template. It works with PY_DEV_MODE because the .parent is purely lexical (doesn't resolve symlinks). I'm open to better ideas how to copy around submit scripts!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this seems fine then

Comment on lines 735 to 720
"(see main help text for possible placeholders)."))
@click.option('--queue', help=("Name of the queue."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget, is the plan to get these defaults (queue, time limit, nodes/cores) from the Machine file? Or are we getting rid of that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I got rid of any use of the machine files in this script. Instead, there's only the submit script template (one per cluster), which sets these defaults. As it stands now we could delete the machine files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll test it out and see how it works

Comment on lines 696 to 675
@click.option('--num-procs',
'-j',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe -c? I know this is make syntax, but -c is srun syntax. Not sure which is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know either. I don't think there's a consensus among other tools. I got used to -j because of make. We could allow both, if that helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah do both

@nilsvu
Copy link
Member Author

nilsvu commented May 18, 2023

Thanks @knelli2, I implemented your suggestions and squashed because I moved commits / files / functions around. Sorry that that makes the smaller changes harder to find!

@nilsvu nilsvu force-pushed the py_schedule branch 3 times, most recently from 48e678b to 3f5d656 Compare May 18, 2023 21:45
Copy link
Member

@wthrowe wthrowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does wheeler have special instructions in the initial PR comment? Is that still relevant, or have things become more uniform?

Main<Metavariables>::checkpoints_dir_prefix_pad() const {
const std::string checkpoints_dir = "Checkpoints";
const std::string prefix = "Checkpoint";
constexpr size_t pad = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it at 6. Characters are cheap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea they're cheap but become hard to read/count at some point. Try this: just glancing at 0000 you immediately know that it's 4 zeros. But glancing at 00000 you don't, you have to actually count. So I can make it 4 characters if you think we need that many, but anything beyond that I'd like to avoid. (also keep in mind that the checkpoint/segment names are actually not set in stone, I want to make sure that none of the code really relies on a hard-coded directory structure but just takes as input the files that it needs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but why do you want to count the digits? I don't see a reason that should be a design goal.

If you don't want to discuss the counter format in this PR then don't change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think readable directory names are a design goal. No need for excessive digits in my opinion. More important for segments than for checkpoints, so if you prefer I can keep the checkpoint format unchanged in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't actually compared them, so I tried creating a list of CheckpointNNN and CheckpointNNNNNN names from 0 to 19 (I got tired of typing at that point). I actually found the six-digit version to be significantly easier to read, because the string of zeros acts as a separator between "Checkpoint" and the number. It makes it easier to visually locate the first nonzero digit. Presumably the same would hold for other prefixes.

[Continuing from a thread below, but consolidating because the topic is the same]

Three digits give us over 3 years in daily segments. That seems enough for the vast majority of simulations. As mentioned above the number of digits is not set in stone, so if we want to do a 5-year simulation or do hourly segments for some reason or something like that we can always increase the number of digits for that case.

You assume that segments will always be based on a timer. This has not always been true for SpEC, and while it is a goal for SpECTRE I don't think we should assume that we will always achieve it immediately. SpEC segments could be extremely short (minutes, maybe less) because they were based on some event in the simulation that forced a new segment.

Anyway, @markscheel, you're probably the most experienced SpEC-runner here. Do you want to weigh in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean with the zero-padding visually separating the number from the prefix. How about 4 digits for both segments and checkpoints then? Still not excessive, hence ok to type without having to pay too much attention to the number of keystrokes (which I find kind of hard for 6 digits), but surely enough and visually pleasing.

```
With SpEC we never needed more than 26*26=676 segments (AA-ZZ), so three
digits for segments is enough.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I vaguely think I've seen three-character SpEC segment IDs.

Copy link
Member Author

@nilsvu nilsvu May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen those, but that doesn't say much.

Three digits give us over 3 years in daily segments. That seems enough for the vast majority of simulations. As mentioned above the number of digits is not set in stone, so if we want to do a 5-year simulation or do hourly segments for some reason or something like that we can always increase the number of digits for that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed the design comment in the other thread to try to keep the discussion more organized.

The point was trying to make here wasn't about your proposed design, but that I don't think "With SpEC we never needed more than 26*26=676 segments" is true, so you shouldn't claim it no matter what we decide for SpECTRE.

if not force:
raise OSError(f"File already exists at '{path}'. Delete it or "
"retry with 'force' ('--force' / '-f')." +
(("\n" + error_hint) if error_hint else ""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] If we were confident that deleting the file was the correct thing to do, we would just do it here, so telling the user to do so may be giving bad instructions. I'd just leave it at "File already exists. Retry with --force to overwrite.".

dest = (dest_dir / src_file.name).resolve()
if dest.exists() and not force:
raise OSError(f"File already exists at '{dest}'. Delete it or "
"retry with 'force' ('--force' / '-f').")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Comment on lines 563 to 522
matched_submit_msg = re.match(r"Submitted batch job (\d+)",
submit_process.stdout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it give a useful return status? That should be more robust than text parsing.

@wthrowe
Copy link
Member

wthrowe commented May 18, 2023

Forgot to give my overall comment: I'm very happy with how the submit scripts are coming out with the templates. This fixes the various problems with inconsistencies and duplication that we currently have, so this resolves the initial issue in #2987, although that spawned a lot of extra discussion that may not be considered resolved.

@nilsvu
Copy link
Member Author

nilsvu commented May 19, 2023

Pushed a fixup @wthrowe

Copy link
Contributor

@knelli2 knelli2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple things. You can ignore the formatting one for now, but the phase change comment is because it's not behaving correctly. I have CheckpointAndExitAfterWallclock in my input file when trying to schedule just a stand alone run dir, and it doesn't error.

Comment on lines 162 to 190
- All arguments to this function, including all additional '**kwargs'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These indents are rendered weird.

- All arguments to this function, including all additional '--params'.
      For example, the additional '--params' can include parameters
      controlling resolution in the input file.     - 'executable_name': Just
      the name of the executable (basename of the       'executable' path).

Same for the rest below.

@nilsvu
Copy link
Member Author

nilsvu commented Jun 3, 2023

Pushed another fixup for resubmissions ✌️

@nilsvu nilsvu force-pushed the py_schedule branch 3 times, most recently from 0dfc692 to 050b818 Compare June 4, 2023 21:33
@knelli2
Copy link
Contributor

knelli2 commented Jun 6, 2023

Ok you can squash now. I imagine there will be bugs we'll need to iron out, but should be ok for now

@nilsvu nilsvu force-pushed the py_schedule branch 2 times, most recently from bd3d278 to f4c2d4f Compare June 6, 2023 00:32
@nilsvu
Copy link
Member Author

nilsvu commented Jun 6, 2023

Squashed it all down. @wthrowe there were so many fixups that I lost track of what you've seen and what you haven't. I can probably dig up the git reflog if you prefer to see a single squashed-together fixup on top of your review above.

@wthrowe
Copy link
Member

wthrowe commented Jun 14, 2023

The test is failing.

To work with placeholders in input files and submit scripts.
@nilsvu
Copy link
Member Author

nilsvu commented Jun 14, 2023

Rebased!

Copy link
Member

@wthrowe wthrowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looked at the documentation, since I'm not good with python.

incrementing integers and continue the run from the previous segment. For
example, the following is a typical 'segments_dir':
\b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably some python thing that I don't understand, but everything I can find on the internet says this is adding a backspace character. Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \b escape marker prevents click from rewrapping the following paragraph: https://click.palletsprojects.com/en/8.1.x/documentation/#preventing-rewrapping

copy_executable: Copy the executable to the run or segments directory.
By default (when set to 'None'):
- If '--run-dir' / '-o' is set, don't copy.
- If '-segments-dir' / '-O' is set, copy to segments directory to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in number of dashes. Again 4 lines down.

Returns: The 'subprocess.CompletedProcess' representing either the process
that scheduled the run, or the process that ran the executable if
'scheduler' is 'None'. Returns 'None' if no run was scheduled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if multiple runs are scheduled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Returns 'None' if no or multiple runs were scheduled.

"Copy the executable to the run or segments directory. "
"(1) When no flag is specified: "
"If '--run-dir' / '-o' is set, don't copy. "
"If '-segments-dir' / '-O' is set, copy to segments "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashes again.

@nilsvu
Copy link
Member Author

nilsvu commented Jun 16, 2023

Squashed in those fixes and split the changes into a few more commits

@nilsvu nilsvu requested a review from knelli2 June 16, 2023 21:50
Copy link
Contributor

@knelli2 knelli2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Excited to start using this

@nilsvu
Copy link
Member Author

nilsvu commented Jun 21, 2023

@wthrowe I think this is ready to go

@wthrowe wthrowe merged commit 4736675 into sxs-collaboration:develop Jun 21, 2023
@nilsvu nilsvu deleted the py_schedule branch June 27, 2023 17:20
@nilsvu nilsvu added the new feature Adds a new feature that's worth highlighting in release notes label Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new feature Adds a new feature that's worth highlighting in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Behavior of submission scripts is not standardized

3 participants