Add scheduling CLI #4996

nilsvu · 2023-05-12T01:47:57Z

Proposed changes

Tool for scheduling runs, submitting jobs, setting up directories for segments and checkpoints, restarting from checkpoints, resubmitting jobs. Also allows to schedule ranges of runs, e.g. for convergence tests.

Try it like this:

spectre schedule INPUT_FILE -o ./Run

To run a convergence test put something like InitialGridPoints: {{ num_points }} in your input file and try this:

spectre schedule INPUT_FILE -p num_points=6...8 -o './Run_p{{ num_points - 1 }}'

Closes #2987.

Upgrade instructions

Install Jinja2 in your Python environment (it's probably already installed because other packages need it).

Code review checklist

The code is documented and the documentation renders correctly. Run
make doc to generate the documentation locally into BUILD_DIR/docs/html.
Then open index.html.
The code follows the stylistic and code quality guidelines listed in the
code review guide.
The PR lists upgrade instructions and is labeled bugfix or
new feature if appropriate.

Further comments

nilsvu · 2023-05-15T18:55:37Z

Notes from a discussion with @wthrowe @knelli2: We decided to look at template inheritance https://jinja.palletsprojects.com/en/3.1.x/templates/#template-inheritance to solve the issue with generating submit scripts.

wthrowe · 2023-05-15T19:07:54Z

Following up on discussion elsewhere: The restart option can be added without duplicating the rest of the command with ${SPECTRE_CHECKPOINT:+ +restart "${SPECTRE_CHECKPOINT}"}. (The :+ expansion gives the stuff to its right, but only if the variable on the left is defined and nonempty.)

nilsvu · 2023-05-16T00:26:59Z

I implemented the submit script generation, take a look when you get the chance

knelli2

Haven't tested this out yet, but did a first pass of review. I'll probably have more things as I test it out.

knelli2 · 2023-05-16T21:58:52Z

src/Parallel/Main.hpp

+  // Return the dir name for the Charm++ checkpoints as well as the prefix for
+  // checkpoint names and their padding. This is a "detail" function so that
+  // these pieces can be defined in one place only.
+  std::tuple<std::string, std::string, size_t> checkpoints_dir_prefix_pad()


Once you make this a full fledged PR, can you factor these changes into another commit? This is just changing out checkpoint directory structure.

knelli2 · 2023-05-16T22:03:16Z

support/SubmitScripts/SubmitBase.sh

+
+/usr/bin/modulecmd bash list


I think this is specific to Wheeler.

Any reason why this isn't module list?

Because for some reason, module list didn't work on the unlimited queue on Wheeler

knelli2 · 2023-05-16T22:04:31Z

support/SubmitScripts/SubmitBase.sh

+    --queue ${SLURM_JOB_PARTITION} \
+    --time-limit $(squeue -j ${SLURM_JOB_ID} -h --Format TimeLimit) \


Is there a slurm env variable that will give you the current time limit?

Unfortunately no :(

knelli2 · 2023-05-16T22:15:57Z

support/Python/Schedule.py

+# Distributed under the MIT License.
+# See LICENSE.txt for details.


Just a general comment: Would we be able to break up this file either into smaller files for the various classes/functions and maybe even separate commits? This is a lot for one commit and it's some dense code to go through.

knelli2 · 2023-05-16T22:28:09Z

support/Python/Schedule.py

+            raise ValueError("Specify an 'executable' ('--executable' / '-e') "
+                             "or list one in the input file.") from err


"in the input file metadata as 'Executable:'"

knelli2 · 2023-05-17T17:47:56Z

support/Python/Schedule.py

+        matched_submit_msg = re.match(r"Submitted batch job (\d+)",
+                                      submit_process.stdout)


Is this the guaranteed message on all clusters?

Haven't seen Slurm output anything else, but we can change this if we run into issues

Does it give a useful return status? That should be more robust than text parsing.

knelli2 · 2023-05-17T17:50:26Z

support/Python/Schedule.py

+    - Exclusive range: "0..3" or "0..<3" (the latter is clearer, but "<" is a
+      special character in the shell)
+    - Inclusive range: "0...3"


Is this standard notation? In bash {0..5} is inclusive.

I got this from Swift where 1..<3 is a half-open (exclusive) range and 1...3 is a closed (inclusive) range. Ranges are often a source of confusion and ambiguity, and I think the Swift syntax is quite clear.

If this follows Swift syntax, maybe add links to Swift in the relevant places so people can go look at that as well.

knelli2 · 2023-05-17T17:55:18Z

support/Python/Schedule.py

+
+default_submit_script_template = Path(__file__).parent / 'Submit.sh'


Where is this? I don't know what the .parent method of a Path does, but this looks like it's looking for a file named Submit.sh in the support/ directory? Oh also, wouldn't this behavior change if you're in PY_DEV_MODE or not because of linking?

Yep it's next to the Schedule.py file in support/. That's were CMake puts the machine-specific submit script template. It works with PY_DEV_MODE because the .parent is purely lexical (doesn't resolve symlinks). I'm open to better ideas how to copy around submit scripts!

Ok this seems fine then

knelli2 · 2023-05-17T17:56:50Z

support/Python/Schedule.py

+          "(see main help text for possible placeholders)."))
+@click.option('--queue', help=("Name of the queue."))


I forget, is the plan to get these defaults (queue, time limit, nodes/cores) from the Machine file? Or are we getting rid of that?

At the moment I got rid of any use of the machine files in this script. Instead, there's only the submit script template (one per cluster), which sets these defaults. As it stands now we could delete the machine files.

Ok I'll test it out and see how it works

knelli2 · 2023-05-17T17:57:52Z

support/Python/Schedule.py

+@click.option('--num-procs',
+              '-j',


Maybe -c? I know this is make syntax, but -c is srun syntax. Not sure which is better.

I don't know either. I don't think there's a consensus among other tools. I got used to -j because of make. We could allow both, if that helps.

Yeah do both

nilsvu · 2023-05-18T05:45:54Z

Thanks @knelli2, I implemented your suggestions and squashed because I moved commits / files / functions around. Sorry that that makes the smaller changes harder to find!

wthrowe

Why does wheeler have special instructions in the initial PR comment? Is that still relevant, or have things become more uniform?

wthrowe · 2023-05-18T21:51:41Z

src/Parallel/Main.hpp

+Main<Metavariables>::checkpoints_dir_prefix_pad() const {
+  const std::string checkpoints_dir = "Checkpoints";
+  const std::string prefix = "Checkpoint";
+  constexpr size_t pad = 3;


Leave it at 6. Characters are cheap.

Yea they're cheap but become hard to read/count at some point. Try this: just glancing at 0000 you immediately know that it's 4 zeros. But glancing at 00000 you don't, you have to actually count. So I can make it 4 characters if you think we need that many, but anything beyond that I'd like to avoid. (also keep in mind that the checkpoint/segment names are actually not set in stone, I want to make sure that none of the code really relies on a hard-coded directory structure but just takes as input the files that it needs).

OK, but why do you want to count the digits? I don't see a reason that should be a design goal.

If you don't want to discuss the counter format in this PR then don't change it.

I think readable directory names are a design goal. No need for excessive digits in my opinion. More important for segments than for checkpoints, so if you prefer I can keep the checkpoint format unchanged in this PR.

I hadn't actually compared them, so I tried creating a list of CheckpointNNN and CheckpointNNNNNN names from 0 to 19 (I got tired of typing at that point). I actually found the six-digit version to be significantly easier to read, because the string of zeros acts as a separator between "Checkpoint" and the number. It makes it easier to visually locate the first nonzero digit. Presumably the same would hold for other prefixes.

[Continuing from a thread below, but consolidating because the topic is the same]

Three digits give us over 3 years in daily segments. That seems enough for the vast majority of simulations. As mentioned above the number of digits is not set in stone, so if we want to do a 5-year simulation or do hourly segments for some reason or something like that we can always increase the number of digits for that case.

You assume that segments will always be based on a timer. This has not always been true for SpEC, and while it is a goal for SpECTRE I don't think we should assume that we will always achieve it immediately. SpEC segments could be extremely short (minutes, maybe less) because they were based on some event in the simulation that forced a new segment.

Anyway, @markscheel, you're probably the most experienced SpEC-runner here. Do you want to weigh in?

I see what you mean with the zero-padding visually separating the number from the prefix. How about 4 digits for both segments and checkpoints then? Still not excessive, hence ok to type without having to pay too much attention to the number of keystrokes (which I find kind of hard for 6 digits), but surely enough and visually pleasing.

wthrowe · 2023-05-18T22:01:23Z

support/Python/DirectoryStructure.py

+    ```
+
+    With SpEC we never needed more than 26*26=676 segments (AA-ZZ), so three
+    digits for segments is enough.


Is this true? I vaguely think I've seen three-character SpEC segment IDs.

I haven't seen those, but that doesn't say much.

Three digits give us over 3 years in daily segments. That seems enough for the vast majority of simulations. As mentioned above the number of digits is not set in stone, so if we want to do a 5-year simulation or do hourly segments for some reason or something like that we can always increase the number of digits for that case.

I've addressed the design comment in the other thread to try to keep the discussion more organized.

The point was trying to make here wasn't about your proposed design, but that I don't think "With SpEC we never needed more than 26*26=676 segments" is true, so you shouldn't claim it no matter what we decide for SpECTRE.

wthrowe · 2023-05-18T22:06:46Z

support/Python/Schedule.py

+        if not force:
+            raise OSError(f"File already exists at '{path}'. Delete it or "
+                          "retry with 'force' ('--force' / '-f')." +
+                          (("\n" + error_hint) if error_hint else ""))


[optional] If we were confident that deleting the file was the correct thing to do, we would just do it here, so telling the user to do so may be giving bad instructions. I'd just leave it at "File already exists. Retry with --force to overwrite.".

wthrowe · 2023-05-18T22:07:39Z

support/Python/Schedule.py

+    dest = (dest_dir / src_file.name).resolve()
+    if dest.exists() and not force:
+        raise OSError(f"File already exists at '{dest}'. Delete it or "
+                      "retry with 'force' ('--force' / '-f').")


Same comment as above.

wthrowe · 2023-05-18T22:16:39Z

support/Python/Schedule.py

+        matched_submit_msg = re.match(r"Submitted batch job (\d+)",
+                                      submit_process.stdout)


Does it give a useful return status? That should be more robust than text parsing.

wthrowe · 2023-05-18T22:36:54Z

Forgot to give my overall comment: I'm very happy with how the submit scripts are coming out with the templates. This fixes the various problems with inconsistencies and duplication that we currently have, so this resolves the initial issue in #2987, although that spawned a lot of extra discussion that may not be considered resolved.

nilsvu · 2023-05-19T18:04:09Z

Pushed a fixup @wthrowe

knelli2

Couple things. You can ignore the formatting one for now, but the phase change comment is because it's not behaving correctly. I have CheckpointAndExitAfterWallclock in my input file when trying to schedule just a stand alone run dir, and it doesn't error.

knelli2 · 2023-05-18T19:14:24Z

support/Python/Schedule.py

+
+        - All arguments to this function, including all additional '**kwargs'.


These indents are rendered weird.

- All arguments to this function, including all additional '--params'. For example, the additional '--params' can include parameters controlling resolution in the input file. - 'executable_name': Just the name of the executable (basename of the 'executable' path).

Same for the rest below.

src/Visualization/Python/ReadInputFile.py

nilsvu · 2023-06-03T09:07:39Z

Pushed another fixup for resubmissions ✌️

knelli2 · 2023-06-06T00:21:58Z

Ok you can squash now. I imagine there will be bugs we'll need to iron out, but should be ok for now

nilsvu · 2023-06-06T00:34:31Z

Squashed it all down. @wthrowe there were so many fixups that I lost track of what you've seen and what you haven't. I can probably dig up the git reflog if you prefer to see a single squashed-together fixup on top of your review above.

wthrowe · 2023-06-14T18:52:03Z

The test is failing.

To work with placeholders in input files and submit scripts.

nilsvu · 2023-06-14T22:48:25Z

Rebased!

wthrowe

Mostly looked at the documentation, since I'm not good with python.

wthrowe · 2023-06-14T21:50:16Z

support/Python/Schedule.py

+    incrementing integers and continue the run from the previous segment. For
+    example, the following is a typical 'segments_dir':
+
+    \b


This is probably some python thing that I don't understand, but everything I can find on the internet says this is adding a backspace character. Why?

The \b escape marker prevents click from rewrapping the following paragraph: https://click.palletsprojects.com/en/8.1.x/documentation/#preventing-rewrapping

wthrowe · 2023-06-14T21:57:53Z

support/Python/Schedule.py

+      copy_executable: Copy the executable to the run or segments directory.
+        By default (when set to 'None'):
+          - If '--run-dir' / '-o' is set, don't copy.
+          - If '-segments-dir' / '-O' is set, copy to segments directory to


Typo in number of dashes. Again 4 lines down.

wthrowe · 2023-06-14T22:01:22Z

support/Python/Schedule.py

+
+    Returns: The 'subprocess.CompletedProcess' representing either the process
+      that scheduled the run, or the process that ran the executable if
+      'scheduler' is 'None'. Returns 'None' if no run was scheduled.


What if multiple runs are scheduled?

Changed to Returns 'None' if no or multiple runs were scheduled.

wthrowe · 2023-06-15T17:41:14Z

support/Python/Schedule.py

+        "Copy the executable to the run or segments directory. "
+        "(1) When no flag is specified: "
+        "If '--run-dir' / '-o' is set, don't copy. "
+        "If '-segments-dir' / '-O' is set, copy to segments "


Dashes again.

nilsvu · 2023-06-16T03:55:51Z

Squashed in those fixes and split the changes into a few more commits

knelli2

Looks good! Excited to start using this

nilsvu · 2023-06-21T20:09:31Z

@wthrowe I think this is ready to go

nilsvu force-pushed the py_schedule branch 8 times, most recently from 9d579fb to 7d6c0cb Compare May 15, 2023 17:03

nilsvu force-pushed the py_schedule branch from 7d6c0cb to 789963e Compare May 16, 2023 00:20

nilsvu force-pushed the py_schedule branch 3 times, most recently from c13fa78 to 7b56453 Compare May 16, 2023 17:04

nilsvu marked this pull request as ready for review May 16, 2023 23:32

knelli2 requested changes May 17, 2023

View reviewed changes

nilsvu force-pushed the py_schedule branch from 7b56453 to 4be68fc Compare May 18, 2023 05:44

nilsvu force-pushed the py_schedule branch 3 times, most recently from 48e678b to 3f5d656 Compare May 18, 2023 21:45

wthrowe requested changes May 18, 2023

View reviewed changes

nilsvu mentioned this pull request May 19, 2023

Behavior of submission scripts is not standardized #2987

Closed

nilsvu force-pushed the py_schedule branch from 957646a to 256f6e6 Compare May 19, 2023 17:40

nilsvu force-pushed the py_schedule branch from d654c8c to 8b8e3a7 Compare May 20, 2023 00:50

knelli2 requested changes May 20, 2023

View reviewed changes

nilsvu force-pushed the py_schedule branch from 8b8e3a7 to b901fcc Compare May 20, 2023 04:13

nilsvu force-pushed the py_schedule branch 3 times, most recently from 0dfc692 to 050b818 Compare June 4, 2023 21:33

nilsvu force-pushed the py_schedule branch 2 times, most recently from bd3d278 to f4c2d4f Compare June 6, 2023 00:32

nilsvu mentioned this pull request Jun 6, 2023

Represent segments/checkpoints in Python, use in status CLI #5078

Merged

3 tasks

nilsvu force-pushed the py_schedule branch from f4c2d4f to 0c840b4 Compare June 6, 2023 17:45

nilsvu mentioned this pull request Jun 6, 2023

Write checkpoints to Checkpoints/ directory #5079

Merged

3 tasks

Add Jinja2 to Py requirements

72e7139

To work with placeholders in input files and submit scripts.

nilsvu force-pushed the py_schedule branch from 0c840b4 to be6efac Compare June 14, 2023 20:26

wthrowe requested changes Jun 15, 2023

View reviewed changes

nilsvu added 4 commits June 15, 2023 20:51

Add scheduling script to CLI

fc754d4

Add resubmit command to CLI

c3c3d5d

Generate Wheeler and CaltechHPC submit scripts

bc42273

Status CLI: get input file from submit context

c96b5e9

nilsvu force-pushed the py_schedule branch from be6efac to c96b5e9 Compare June 16, 2023 03:54

wthrowe approved these changes Jun 16, 2023

View reviewed changes

nilsvu requested a review from knelli2 June 16, 2023 21:50

knelli2 approved these changes Jun 17, 2023

View reviewed changes

wthrowe merged commit 4736675 into sxs-collaboration:develop Jun 21, 2023

nilsvu deleted the py_schedule branch June 27, 2023 17:20

AlexCarpenter46 mentioned this pull request Jul 12, 2023

Frontera Submit Script Update #3792

Closed

3 tasks

nilsvu added the new feature Adds a new feature that's worth highlighting in release notes label Jul 28, 2023

		--queue ${SLURM_JOB_PARTITION} \
		--time-limit $(squeue -j ${SLURM_JOB_ID} -h --Format TimeLimit) \

		# Distributed under the MIT License.
		# See LICENSE.txt for details.

		raise ValueError("Specify an 'executable' ('--executable' / '-e') "
		"or list one in the input file.") from err

		matched_submit_msg = re.match(r"Submitted batch job (\d+)",
		submit_process.stdout)


		default_submit_script_template = Path(__file__).parent / 'Submit.sh'

		"(see main help text for possible placeholders)."))
		@click.option('--queue', help=("Name of the queue."))


		- All arguments to this function, including all additional '**kwargs'.

		@click.option('--num-procs',
		'-j',

Add scheduling CLI #4996

Add scheduling CLI #4996

Uh oh!

Conversation

nilsvu commented May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Upgrade instructions

Code review checklist

Further comments

Uh oh!

nilsvu commented May 15, 2023

Uh oh!

wthrowe commented May 15, 2023

Uh oh!

nilsvu commented May 16, 2023

Uh oh!

knelli2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nilsvu commented May 18, 2023

Uh oh!

wthrowe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nilsvu commented May 12, 2023 •

edited

Loading

nilsvu May 19, 2023 •

edited

Loading