Skip to content

add script desi_eval_prod_jobs to summarize production job info#14

Merged
akremin merged 6 commits into
mainfrom
eval_prod_jobs
Oct 14, 2024
Merged

add script desi_eval_prod_jobs to summarize production job info#14
akremin merged 6 commits into
mainfrom
eval_prod_jobs

Conversation

@sbailey

@sbailey sbailey commented Oct 11, 2024

Copy link
Copy Markdown
Contributor

This PR adds a script desi_eval_prod_jobs which will query SLURM for all jobs in a production (including failed, canceled, etc.) so that they can be saved to an output file for later analysis, even after SLURM has forgotten about those jobids (something like 6 months later at NERSC).

Example outputs in /global/cfs/cdirs/desi/users/sjbailey/dev/qinfo generated with:

desi_eval_prod_jobs -s kibo -o kibo-jobs.csv --summary kibo-summary.csv

Currently this keeps columns

JOBID,JOBNAME,PARTITION,CONSTRAINTS,NNODES,SUBMIT,ELIGIBLE,START,END,ELAPSED,STATE,EXITCODE,JOBDESC,NODE_HOURS,GPU

Most of those are taken directly from sacct, but the following are derived for convenience:

  • JOBDESC: production job description, e.g. "ccdcalib", "tilenight", "cumulative", "zpix"
  • NODE_HOURS: ELAPSED hh:mm:ss string parsed into hours and multiplied by NNODES
  • GPU: 1/0 for whether this was run on GPU nodes or not, based on parsing CONSTRAINTS

@akremin do you think this is the right set of columns to keep? CONSTRAINTS and PARTITION might be extraneous. Are there other columns we should keep?

desi_eval_prod_jobs also prints a summary of states per jobdesc; that was saved separately into kibo-summary.csv, but it can also be re-generated by re-reading the full jobs table without having to re-query slurm. This was useful for debugging, and also may be useful if we want to update the summary info in a later iteration:

[login29 qinfo] desi_eval_prod_jobs -i kibo-jobs.csv
  JOBDESC   CPUGPU NODE_HOURS PERCENT COMPLETED TIMEOUT FAILED CANCELLED NODE_FAIL
----------- ------ ---------- ------- --------- ------- ------ --------- ---------
    linkcal    cpu        0.1     0.0        17       0      1         5         0
nightlybias    cpu        0.2     0.0         3       0      0         0         0
   ccdcalib    cpu      120.1     1.1       768       5     10         6         0
        arc    cpu     1758.0    15.4      3834       7     69        90         0
   psfnight    cpu       15.4     0.1       767       4      0        80         0
       flat    gpu     1527.3    13.4      9803      17      8      1064         0
nightlyflat    gpu       30.2     0.3       762      10      1       225         0
  tilenight    gpu     4076.1    35.8     17124      23     11      5711         1
 cumulative    gpu     2658.8    23.3     14179       7    168      4838         1
       zpix    gpu     1210.5    10.6      5698       1     11        10         0

@sbailey sbailey requested a review from akremin October 11, 2024 23:37

@akremin akremin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks nice. I added one comment where we can reduce the amount of hardcoded paths in the script. Another is an optional optimization. And a third to split out the argparser for future testing and maintainability.

Comment thread bin/desi_eval_prod_jobs Outdated
"""

proddir = desispec.io.specprod_root(specprod=specprod, readonly=True)
procfiles = sorted(glob.glob(f'{proddir}/processing_tables/processing_table_{specprod}-????????.csv'))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately desispec.workflow.proctable.read_minimal_full_proctab_cols() doesn't currently read ALL_QIDs but would otherwise be able to serve as a more robust form of this function.

I won't request a cross-module change here, but we could consider adding ALL_QIDs to that cache in the future such that you could use it here. In the meantime, I would suggest we replace your hardcoded logic here for the following which is used in the function I mentioned above:

        proddir = findfile('proctable', night='99999999', readonly=True)
        procfiles = sorted(glob.glob(proddir.replace('99999999', '202?????')))

This has less baked-in assumptions and works in all cases except when 99999999 is in your path.

Comment thread bin/desi_eval_prod_jobs Outdated
Comment thread bin/desi_eval_prod_jobs Outdated
@sbailey

sbailey commented Oct 14, 2024

Copy link
Copy Markdown
Contributor Author

Thanks @akremin. I have added updates for all 3 suggestions.

@weaverba137

Copy link
Copy Markdown
Member

desida is a Python package, so my preference would be to create a file in py/desida and then have bin/desi_eval_prod_jobs be a minimialist sys.exit(main()) script. You could also add unit tests easier that way.

@sbailey

sbailey commented Oct 14, 2024

Copy link
Copy Markdown
Contributor Author

I agree with moving scripts into py/desida/scripts when they need to be called from other pipelining code or have pieces to be re-used, and to enable unit tests, but I'm not sure it is worth it for scripts like this that will be run approximately once a year for major prods.

In particular for unit tests I don't think it is worth it to mock up fake slurm return calls etc. since the most likely failure points that we could/should test would be things like a slurm update changing its formatting and breaking parsing, or an update to production formats breaking parsing. That could only be tested at NERSC using a mini-prod built upon recent nights of daily... I'd rather spend time expanding unit test coverage of desispec itself.

I admit that I'm partially motivated with wanting to move on with a contribution of a "minimal viable product" script that works and provides useful functionality, and only add additional features if/when they are needed. @akremin can you be the tie breaker here for long term maintenance considerations of whether it is worth restructuring and adding unit tests now?

@akremin

akremin commented Oct 14, 2024

Copy link
Copy Markdown
Member

Given that the package is currently uniform in its proper usage of minimal scripts and code in py/desida, I will agree with Ben that we should make that adjustment. I understand it is a rarely used script and we want to get this merged and tagged, but ideally it would be 5 minutes of work to move the code and add a calling script.

I do not think we need to add unit tests. It is good enough that we put it in a form where we could add them later on.

My requests have been resolved. I will wait until you respond to Ben's request before testing so that I'm running on the final version.

@sbailey

sbailey commented Oct 14, 2024

Copy link
Copy Markdown
Contributor Author

Thanks. I moved the functions to py/desida/prodjobs.py and make the script a lightweight wrapper.

@akremin

akremin commented Oct 14, 2024

Copy link
Copy Markdown
Member

@sbailey To test this I ran it on Jura and ran into a crash that I traced back to the healpix qid derivation:

Traceback (most recent call last):
  File "/global/homes/k/kremin/desi/python/desida/bin/desi_eval_prod_jobs", line 10, in <module>
    sys.exit(main())
  File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 185, in main
    qinfo = load_qinfo(args.specprod)
  File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 97, in load_qinfo
    jobdesc_qinfo = queue_info_from_qids(qids, columns=columns)
  File "/global/homes/k/kremin/desi/python/desispec/py/desispec/workflow/queue.py", line 287, in queue_info_from_qids
    qids = np.atleast_1d(qids).astype(int)
ValueError: invalid literal for int() with base 10: 'groupspectra'

This happens for jobdesc = 'zpix', which has multiple entries title 'groupspectra'. Those were likely custom jobs due to memory issues in Jura that we were better about in Kibo.

I don't expect this to be an issue in Kibo or future prods, so you can leave it if you wish. But it may help to at a check at line 55 to verify that qid = os.path.splitext(os.path.basename(filename))[0].split('-')[-1] is an int. Otherwise don't include it in the qid list.

@sbailey

sbailey commented Oct 14, 2024

Copy link
Copy Markdown
Contributor Author

Good catch. I had only tested on Kibo. I added that non-int "qid" check and print a warning message but don't crash.

@akremin

akremin commented Oct 14, 2024

Copy link
Copy Markdown
Member

Thanks for the update. try-except seemed like overkill vs if qid.isnumeric(). However, instead of changing the logic, I've moved the qid parsing into the try-except as well so that the code is also robust to issues there. The file parsing logic seems like it would always be correct, but this gives more protection given that we're doing a try anyway.

@akremin akremin merged commit 33f219f into main Oct 14, 2024
@akremin akremin deleted the eval_prod_jobs branch October 14, 2024 22:24
@sbailey

sbailey commented Oct 14, 2024

Copy link
Copy Markdown
Contributor Author

I didn't know about str.isnumeric but indeed I probably would have used it if I had known about it. I can't think of any string that would cause the filename parsing to crash, but putting it into the try/except is fine.

For the record: I was surprised that "1.1".isnumeric() and "-11".isnumeric() are both False. i.e. it is testing regex "[0-9]+", not "can this string be converted into a number". That happens to be what we want for this case, but beware of other uses.

I will make one post-facto update to main, to use try / except ValueError. Bare excepts can catch way more than intended, masking other bugs (e.g. if we had forgotten to import os). Note that os.path.splitext and str.split('-') will always have [0] and [-1] elements, even if there isn't an extension or '-' in the filename, so we shouldn't have IndexErrors or other parsing problems there. If we do, I want to know about it.

@sbailey

sbailey commented Oct 14, 2024

Copy link
Copy Markdown
Contributor Author

Last (?) note: I ran this for Kibo with

module load desida/main
cd $CFS/desi/users/desi/productions/kibo
desi_eval_prod_jobs -s kibo -o kibo-jobs.csv --summary kibo-job-summary.csv

i.e. this is cached in the productions/kibo directory (formerly labeled_proc_run_files/kibo), not spectro/redux/kibo itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants