Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when resubmitting queue failures #2413

Open
akremin opened this issue Nov 19, 2024 · 0 comments
Open

Error when resubmitting queue failures #2413

akremin opened this issue Nov 19, 2024 · 0 comments

Comments

@akremin
Copy link
Member

akremin commented Nov 19, 2024

The queue failure script appears to have a subtle bug when doing cross-night dependency tracking. This came about in daily during cleanup activities and my hypothesis is that the tile was also problematic on another night and was rerun, leading to a new internal id for the job, which is not what is recorded in this processing table. That was known to be a shortcoming of the cross-night system.

Todo is to look into this more and either fix a bug if present or think about ways to mitigate the shortcoming listed above if that is the issue.

INFO:processing.py:1273:recursive_submit_failed: Identified row 241114049 as needing resubmission.
INFO:processing.py:1274:recursive_submit_failed: 241114049: Expid(s): [262809]  Job: cumulative
INFO:processing.py:1288:recursive_submit_failed: Internal ID: 241017044 not in id_to_row_map. This is expected since it's from another day. 
INFO:proctable.py:647:read_minimal_full_proctab_cols: Loading the following processing tables for full processing table cache from directory: /global/cfs/cdirs/desi/spectro/redux/daily/processing_tables, filenames: ['processing_table_daily-20241017.csv']
INFO:proctable.py:680:read_minimal_full_proctab_cols: Caching processing table rows for full cache
INFO:proctable.py:712:update_full_ptab_cache: Replacing all current entries in processing table cache for nights [20241017]
INFO:queue.py:495:update_from_queue: qtable not provided, querying Slurm using ptab's LATEST_QID set
INFO:queue.py:502:update_from_queue: Querying Slurm for 0 QIDs from table of length 41.
INFO:queue.py:507:update_from_queue: No QIDs left to query. Returning the original table.
INFO:proctable.py:712:update_full_ptab_cache: Replacing all current entries in processing table cache for nights [20241017]
INFO:processing.py:1288:recursive_submit_failed: Internal ID: 241110049 not in id_to_row_map. This is expected since it's from another day. 
INFO:proctable.py:647:read_minimal_full_proctab_cols: Loading the following processing tables for full processing table cache from directory: /global/cfs/cdirs/desi/spectro/redux/daily/processing_tables, filenames: ['processing_table_daily-20241110.csv']
INFO:proctable.py:680:read_minimal_full_proctab_cols: Caching processing table rows for full cache
INFO:proctable.py:712:update_full_ptab_cache: Replacing all current entries in processing table cache for nights [20241110]
INFO:queue.py:495:update_from_queue: qtable not provided, querying Slurm using ptab's LATEST_QID set
INFO:queue.py:502:update_from_queue: Querying Slurm for 13 QIDs from table of length 94.
INFO:queue.py:331:queue_info_from_qids: Querying Slurm with the following: sacct -X --parsable2 --delimiter=, --format=jobid,jobname,partition,submit,eligible,start,end,elapsed,state,exitcode -j 33078756,33078758,33078759,33078760,33078761,33078762,33078763,33078764,33078765,33078767,33078768,33078770,33078771
INFO:queue.py:511:update_from_queue: Slurm returned information on 13 jobs out of 94 jobs in the ptab. Updating those now.
Traceback (most recent call last):
  File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/desispec/main/bin/desi_proc_night", line 141, in <module>
    proc_night(**args.__dict__)
  File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/desispec/main/py/desispec/scripts/proc_night.py", line 376, in proc_night
    ptable, nsubmits = update_and_recursively_submit(ptable,
  File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/desispec/main/py/desispec/workflow/processing.py", line 1224, in update_and_recursively_submit
    proc_table, submits = recursive_submit_failed(rown, proc_table, submits,
  File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/desispec/main/py/desispec/workflow/processing.py", line 1300, in recursive_submit_failed
    entry = reftab[reftab['INTID'] == idep][0]
  File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/table/table.py", line 2064, in __getitem__
    return self.Row(self, item)
  File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/table/row.py", line 41, in __init__
    raise IndexError(
IndexError: index 0 out of range for table with length 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant