Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cori data causing cross-night dependency error in daily processing? #2331

Open
rongpu opened this issue Aug 19, 2024 · 5 comments
Open

Cori data causing cross-night dependency error in daily processing? #2331

rongpu opened this issue Aug 19, 2024 · 5 comments

Comments

@rongpu
Copy link
Contributor

rongpu commented Aug 19, 2024

Tile 40062 failed during processing. The tile has been observed on 4 different nights (20220418, 20230620, 20230806, 20240818). @sbailey's hypothesis for the failure is that the night from 2022 was processed on Cori and it fails the cross-night dependencies (#2321) because it does not have a QID.

@rongpu
Copy link
Contributor Author

rongpu commented Aug 20, 2024

4 backup tiles from 20240818 are affected by this issue: 42083, 40062, 42946, 42918

@abrodze
Copy link
Member

abrodze commented Aug 20, 2024

Adding tile 40689 from 20240819 to this list

@abrodze
Copy link
Member

abrodze commented Aug 20, 2024

41853 failed on 20240819 for potentially related reasons? I purged and tried to rerun the tile data from 20211212 (which failed on standard stars) to see if the issue on 0819 could be fixed. That attempt also failed.

@akremin
Copy link
Member

akremin commented Aug 23, 2024

If the Slurm jobid isn't found then the code just doesn't update the STATUS and it goes based on what the STATUS was in the processing table. So if these jobs had failed in the past it would cause an issue, but if they were successful it shouldn't (unless there is a bug).

Another possibility is that the Cori jobid's are now overlapping Perlmutter jobid's and Slurm is returning the status of the Perlmutter job of the same ID, which may have failed.

I will get to the bottom of this and report back.

@abrodze We should not be purging old processed data unless a new data quality issue has been identified on that old night. It may be true that it would have solved the issue on that tile created by new code, but it wouldn't address the underlying issue for any of the other backup tiles and makes the daily dataset even more inhomogenous.

@akremin
Copy link
Member

akremin commented Aug 24, 2024

An odd twist in this story. I've been digging into tile 40062. It turns out the first exposure was only processed through sky subtraction and is therefore not an issue for redshift dependencies. I have verified that that is true in the processing table for the latest night -- it is only trying to make dependencies with the two nights in 2023 and the tilenight job on 20240818 itself. That is all good.

The bad news is that from what I can tell the 2023 nights were processed using Perlmutter but Slurm still doesn't remember them in sacct. Dumping some outputs below. I'll dig into this more on Monday and see if there is anything we can do here.

kremin@perlmutter:login25:/global/cfs/cdirs/desi/spectro/redux/daily/run/scripts> grep ",40062," processing_table_daily-202*
processing_table_daily-20220418.csv:130802|,science,40062,20220418,,skysub,low_sn|,a0123456789,0,220418125,,prestdstar,57981119,1650364155,COMPLETED,,220418020|,57967875|,57981119|
processing_table_daily-20230620.csv:186106|,science,40062,20230620,,all,|,a0123456789,0,230620061,,tilenight,10462343,1687358678,SUBMITTED,,230620020|,10447592|,10450042|10462343|
processing_table_daily-20230620.csv:186106|,science,40062,20230620,,all,|,a0123456789,0,230620062,,cumulative,10462372,1687358705,SUBMITTED,,230620061|,10462343|,10450043|10462372|
processing_table_daily-20230806.csv:188993|,science,40062,20230806,,all,|,a0123456789,0,230806022,,tilenight,13503127,1691454184,PENDING,,230806020|,13503119|,13503127|
processing_table_daily-20230806.csv:188993|,science,40062,20230806,,all,|,a0123456789,0,230806023,,cumulative,13503148,1691454214,PENDING,,230806022|,13503127|,13503148|
processing_table_daily-20230806.csv.20230821_13h47m:188993|,science,40062,20230806,,all,|,a0123456789,0,230806022,,tilenight,13503127,1691454184,PENDING,,230806020|,13503119|,13503127|
processing_table_daily-20230806.csv.20230821_13h47m:188993|,science,40062,20230806,,all,|,a0123456789,0,230806023,,cumulative,13503148,1691454214,PENDING,,230806022|,13503127|,13503148|
processing_table_daily-20240818.csv:249032|,science,40062,20240818,,all,|,a0123456789,0,240818030,,tilenight,29563922,1724091743,COMPLETED,,240818019|,29563743|,29563922|
processing_table_daily-20240818.csv:249032|,science,40062,20240818,,all,|,a0123456789,0,240818031,,cumulative,1,-99,UNSUBMITTED,,240818030|230620061|230806022|,10462343|13503127|29563922|,|
kremin@perlmutter:login25:/global/cfs/cdirs/desi/spectro/redux/daily/run/scripts> sacct -j 10462343,13503127,29563922
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
29563922     tilenight+ realtime_+     desi_g        128  COMPLETED      0:0 
29563922.ba+      batch                desi_g        128  COMPLETED      0:0 
29563922.ex+     extern                desi_g        128  COMPLETED      0:0 
29563922.0   desi_mps_+                desi_g        128  COMPLETED      0:0 
kremin@perlmutter:login25:/global/cfs/cdirs/desi/spectro/redux/daily/run/scripts> sacct -j 10462343
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants