add script desi_eval_prod_jobs to summarize production job info by sbailey · Pull Request #14 · desihub/desida

sbailey · 2024-10-11T23:37:34Z

This PR adds a script desi_eval_prod_jobs which will query SLURM for all jobs in a production (including failed, canceled, etc.) so that they can be saved to an output file for later analysis, even after SLURM has forgotten about those jobids (something like 6 months later at NERSC).

Example outputs in /global/cfs/cdirs/desi/users/sjbailey/dev/qinfo generated with:

desi_eval_prod_jobs -s kibo -o kibo-jobs.csv --summary kibo-summary.csv

Currently this keeps columns

JOBID,JOBNAME,PARTITION,CONSTRAINTS,NNODES,SUBMIT,ELIGIBLE,START,END,ELAPSED,STATE,EXITCODE,JOBDESC,NODE_HOURS,GPU

Most of those are taken directly from sacct, but the following are derived for convenience:

JOBDESC: production job description, e.g. "ccdcalib", "tilenight", "cumulative", "zpix"
NODE_HOURS: ELAPSED hh:mm:ss string parsed into hours and multiplied by NNODES
GPU: 1/0 for whether this was run on GPU nodes or not, based on parsing CONSTRAINTS

@akremin do you think this is the right set of columns to keep? CONSTRAINTS and PARTITION might be extraneous. Are there other columns we should keep?

desi_eval_prod_jobs also prints a summary of states per jobdesc; that was saved separately into kibo-summary.csv, but it can also be re-generated by re-reading the full jobs table without having to re-query slurm. This was useful for debugging, and also may be useful if we want to update the summary info in a later iteration:

[login29 qinfo] desi_eval_prod_jobs -i kibo-jobs.csv
  JOBDESC   CPUGPU NODE_HOURS PERCENT COMPLETED TIMEOUT FAILED CANCELLED NODE_FAIL
----------- ------ ---------- ------- --------- ------- ------ --------- ---------
    linkcal    cpu        0.1     0.0        17       0      1         5         0
nightlybias    cpu        0.2     0.0         3       0      0         0         0
   ccdcalib    cpu      120.1     1.1       768       5     10         6         0
        arc    cpu     1758.0    15.4      3834       7     69        90         0
   psfnight    cpu       15.4     0.1       767       4      0        80         0
       flat    gpu     1527.3    13.4      9803      17      8      1064         0
nightlyflat    gpu       30.2     0.3       762      10      1       225         0
  tilenight    gpu     4076.1    35.8     17124      23     11      5711         1
 cumulative    gpu     2658.8    23.3     14179       7    168      4838         1
       zpix    gpu     1210.5    10.6      5698       1     11        10         0

akremin

Thanks this looks nice. I added one comment where we can reduce the amount of hardcoded paths in the script. Another is an optional optimization. And a third to split out the argparser for future testing and maintainability.

akremin · 2024-10-11T23:59:15Z

+    """
+
+    proddir = desispec.io.specprod_root(specprod=specprod, readonly=True)
+    procfiles = sorted(glob.glob(f'{proddir}/processing_tables/processing_table_{specprod}-????????.csv'))


Unfortunately desispec.workflow.proctable.read_minimal_full_proctab_cols() doesn't currently read ALL_QIDs but would otherwise be able to serve as a more robust form of this function.

I won't request a cross-module change here, but we could consider adding ALL_QIDs to that cache in the future such that you could use it here. In the meantime, I would suggest we replace your hardcoded logic here for the following which is used in the function I mentioned above:

proddir = findfile('proctable', night='99999999', readonly=True) procfiles = sorted(glob.glob(proddir.replace('99999999', '202?????')))

This has less baked-in assumptions and works in all cases except when 99999999 is in your path.

sbailey · 2024-10-14T18:01:42Z

Thanks @akremin. I have added updates for all 3 suggestions.

weaverba137 · 2024-10-14T18:07:23Z

desida is a Python package, so my preference would be to create a file in py/desida and then have bin/desi_eval_prod_jobs be a minimialist sys.exit(main()) script. You could also add unit tests easier that way.

sbailey · 2024-10-14T18:57:07Z

I agree with moving scripts into py/desida/scripts when they need to be called from other pipelining code or have pieces to be re-used, and to enable unit tests, but I'm not sure it is worth it for scripts like this that will be run approximately once a year for major prods.

In particular for unit tests I don't think it is worth it to mock up fake slurm return calls etc. since the most likely failure points that we could/should test would be things like a slurm update changing its formatting and breaking parsing, or an update to production formats breaking parsing. That could only be tested at NERSC using a mini-prod built upon recent nights of daily... I'd rather spend time expanding unit test coverage of desispec itself.

I admit that I'm partially motivated with wanting to move on with a contribution of a "minimal viable product" script that works and provides useful functionality, and only add additional features if/when they are needed. @akremin can you be the tie breaker here for long term maintenance considerations of whether it is worth restructuring and adding unit tests now?

akremin · 2024-10-14T19:03:25Z

Given that the package is currently uniform in its proper usage of minimal scripts and code in py/desida, I will agree with Ben that we should make that adjustment. I understand it is a rarely used script and we want to get this merged and tagged, but ideally it would be 5 minutes of work to move the code and add a calling script.

I do not think we need to add unit tests. It is good enough that we put it in a form where we could add them later on.

My requests have been resolved. I will wait until you respond to Ben's request before testing so that I'm running on the final version.

sbailey · 2024-10-14T20:03:44Z

Thanks. I moved the functions to py/desida/prodjobs.py and make the script a lightweight wrapper.

akremin · 2024-10-14T20:40:54Z

@sbailey To test this I ran it on Jura and ran into a crash that I traced back to the healpix qid derivation:

Traceback (most recent call last):
  File "/global/homes/k/kremin/desi/python/desida/bin/desi_eval_prod_jobs", line 10, in <module>
    sys.exit(main())
  File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 185, in main
    qinfo = load_qinfo(args.specprod)
  File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 97, in load_qinfo
    jobdesc_qinfo = queue_info_from_qids(qids, columns=columns)
  File "/global/homes/k/kremin/desi/python/desispec/py/desispec/workflow/queue.py", line 287, in queue_info_from_qids
    qids = np.atleast_1d(qids).astype(int)
ValueError: invalid literal for int() with base 10: 'groupspectra'

This happens for jobdesc = 'zpix', which has multiple entries title 'groupspectra'. Those were likely custom jobs due to memory issues in Jura that we were better about in Kibo.

I don't expect this to be an issue in Kibo or future prods, so you can leave it if you wish. But it may help to at a check at line 55 to verify that qid = os.path.splitext(os.path.basename(filename))[0].split('-')[-1] is an int. Otherwise don't include it in the qid list.

sbailey · 2024-10-14T21:59:41Z

Good catch. I had only tested on Kibo. I added that non-int "qid" check and print a warning message but don't crash.

akremin · 2024-10-14T22:23:25Z

Thanks for the update. try-except seemed like overkill vs if qid.isnumeric(). However, instead of changing the logic, I've moved the qid parsing into the try-except as well so that the code is also robust to issues there. The file parsing logic seems like it would always be correct, but this gives more protection given that we're doing a try anyway.

sbailey · 2024-10-14T22:37:37Z

I didn't know about str.isnumeric but indeed I probably would have used it if I had known about it. I can't think of any string that would cause the filename parsing to crash, but putting it into the try/except is fine.

For the record: I was surprised that "1.1".isnumeric() and "-11".isnumeric() are both False. i.e. it is testing regex "[0-9]+", not "can this string be converted into a number". That happens to be what we want for this case, but beware of other uses.

I will make one post-facto update to main, to use try / except ValueError. Bare excepts can catch way more than intended, masking other bugs (e.g. if we had forgotten to import os). Note that os.path.splitext and str.split('-') will always have [0] and [-1] elements, even if there isn't an extension or '-' in the filename, so we shouldn't have IndexErrors or other parsing problems there. If we do, I want to know about it.

sbailey · 2024-10-14T22:45:03Z

Last (?) note: I ran this for Kibo with

module load desida/main
cd $CFS/desi/users/desi/productions/kibo
desi_eval_prod_jobs -s kibo -o kibo-jobs.csv --summary kibo-job-summary.csv

i.e. this is cached in the productions/kibo directory (formerly labeled_proc_run_files/kibo), not spectro/redux/kibo itself.

add script desi_eval_prod_jobs to summarize production job info

a8160a8

sbailey requested a review from akremin October 11, 2024 23:37

akremin reviewed Oct 12, 2024

View reviewed changes

sbailey added 2 commits October 14, 2024 10:37

less hardcoded assumption about proctable paths and names

42e385e

optimizations; separate parse function

79c488b

refactor script to lightweight wrapper of functions

a91dd78

Robust to non-standard zpix logfile names

c972995

try-except for zpix file parse

db77e78

akremin merged commit 33f219f into main Oct 14, 2024

akremin deleted the eval_prod_jobs branch October 14, 2024 22:24

Conversation

sbailey commented Oct 11, 2024

Uh oh!

akremin left a comment

Choose a reason for hiding this comment

Uh oh!

akremin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sbailey commented Oct 14, 2024

Uh oh!

weaverba137 commented Oct 14, 2024

Uh oh!

sbailey commented Oct 14, 2024

Uh oh!

akremin commented Oct 14, 2024

Uh oh!

sbailey commented Oct 14, 2024

Uh oh!

akremin commented Oct 14, 2024

Uh oh!

sbailey commented Oct 14, 2024

Uh oh!

akremin commented Oct 14, 2024

Uh oh!

sbailey commented Oct 14, 2024

Uh oh!

sbailey commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants