add script desi_eval_prod_jobs to summarize production job info#14
Conversation
akremin
left a comment
There was a problem hiding this comment.
Thanks this looks nice. I added one comment where we can reduce the amount of hardcoded paths in the script. Another is an optional optimization. And a third to split out the argparser for future testing and maintainability.
| """ | ||
|
|
||
| proddir = desispec.io.specprod_root(specprod=specprod, readonly=True) | ||
| procfiles = sorted(glob.glob(f'{proddir}/processing_tables/processing_table_{specprod}-????????.csv')) |
There was a problem hiding this comment.
Unfortunately desispec.workflow.proctable.read_minimal_full_proctab_cols() doesn't currently read ALL_QIDs but would otherwise be able to serve as a more robust form of this function.
I won't request a cross-module change here, but we could consider adding ALL_QIDs to that cache in the future such that you could use it here. In the meantime, I would suggest we replace your hardcoded logic here for the following which is used in the function I mentioned above:
proddir = findfile('proctable', night='99999999', readonly=True)
procfiles = sorted(glob.glob(proddir.replace('99999999', '202?????')))
This has less baked-in assumptions and works in all cases except when 99999999 is in your path.
|
Thanks @akremin. I have added updates for all 3 suggestions. |
|
desida is a Python package, so my preference would be to create a file in |
|
I agree with moving scripts into py/desida/scripts when they need to be called from other pipelining code or have pieces to be re-used, and to enable unit tests, but I'm not sure it is worth it for scripts like this that will be run approximately once a year for major prods. In particular for unit tests I don't think it is worth it to mock up fake slurm return calls etc. since the most likely failure points that we could/should test would be things like a slurm update changing its formatting and breaking parsing, or an update to production formats breaking parsing. That could only be tested at NERSC using a mini-prod built upon recent nights of daily... I'd rather spend time expanding unit test coverage of desispec itself. I admit that I'm partially motivated with wanting to move on with a contribution of a "minimal viable product" script that works and provides useful functionality, and only add additional features if/when they are needed. @akremin can you be the tie breaker here for long term maintenance considerations of whether it is worth restructuring and adding unit tests now? |
|
Given that the package is currently uniform in its proper usage of minimal scripts and code in py/desida, I will agree with Ben that we should make that adjustment. I understand it is a rarely used script and we want to get this merged and tagged, but ideally it would be 5 minutes of work to move the code and add a calling script. I do not think we need to add unit tests. It is good enough that we put it in a form where we could add them later on. My requests have been resolved. I will wait until you respond to Ben's request before testing so that I'm running on the final version. |
|
Thanks. I moved the functions to py/desida/prodjobs.py and make the script a lightweight wrapper. |
|
@sbailey To test this I ran it on Jura and ran into a crash that I traced back to the healpix qid derivation: This happens for jobdesc = 'zpix', which has multiple entries title 'groupspectra'. Those were likely custom jobs due to memory issues in Jura that we were better about in Kibo. I don't expect this to be an issue in Kibo or future prods, so you can leave it if you wish. But it may help to at a check at line 55 to verify that |
|
Good catch. I had only tested on Kibo. I added that non-int "qid" check and print a warning message but don't crash. |
|
Thanks for the update. |
|
I didn't know about For the record: I was surprised that I will make one post-facto update to main, to use try / except ValueError. Bare excepts can catch way more than intended, masking other bugs (e.g. if we had forgotten to import os). Note that |
|
Last (?) note: I ran this for Kibo with i.e. this is cached in the productions/kibo directory (formerly labeled_proc_run_files/kibo), not spectro/redux/kibo itself. |
This PR adds a script
desi_eval_prod_jobswhich will query SLURM for all jobs in a production (including failed, canceled, etc.) so that they can be saved to an output file for later analysis, even after SLURM has forgotten about those jobids (something like 6 months later at NERSC).Example outputs in /global/cfs/cdirs/desi/users/sjbailey/dev/qinfo generated with:
Currently this keeps columns
Most of those are taken directly from sacct, but the following are derived for convenience:
JOBDESC: production job description, e.g. "ccdcalib", "tilenight", "cumulative", "zpix"NODE_HOURS: ELAPSED hh:mm:ss string parsed into hours and multiplied by NNODESGPU: 1/0 for whether this was run on GPU nodes or not, based on parsing CONSTRAINTS@akremin do you think this is the right set of columns to keep? CONSTRAINTS and PARTITION might be extraneous. Are there other columns we should keep?
desi_eval_prod_jobsalso prints a summary of states per jobdesc; that was saved separately into kibo-summary.csv, but it can also be re-generated by re-reading the full jobs table without having to re-query slurm. This was useful for debugging, and also may be useful if we want to update the summary info in a later iteration: