Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Uncaught exceptions lead to unclear job failures #107

Open
abought opened this issue Jun 23, 2023 · 3 comments
Open

[bug] Uncaught exceptions lead to unclear job failures #107

abought opened this issue Jun 23, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@abought
Copy link
Collaborator

abought commented Jun 23, 2023

Summary

Certain kinds of QC errors are not handled in the code, and lead to mysterious job failures that a user cannot diagnose or fix. This is one of our most frequent helpdesk inquiries.

Actual behavior

Non-admin users are not allowed to see the full job logs tab. Thus, they cannot inspect the stack trace to see the error description. Th actual information they are presented with is rather opaque.
Screen Shot 2023-06-23 at 3 29 57 PM

Screen Shot 2023-06-23 at 3 43 05 PM

Common scenarios

  • htsjdk does not support VCF4.3, and files in this format fail to parse.

Task 'Calculating QC Statistics' failed.
Exception:java.lang.IllegalArgumentException: Writing VCF version VCF4_3 is not implemented
at htsjdk.variant.variantcontext.writer.VCFWriter.rejectVCFV43Headers(VCFWriter.java:275)

  • It appears that certain VCF fields are required. This isn't captured in the data preparation docs, and some users have triggered an error they cannot see.

Task 'Calculating QC Statistics' failed.
Exception:java.io.IOException: /mnt/jobs/job-20230623-145718-031/input/files/split.chr1.vcf.gz: Line 7812: No GT field found in FORMAT column.
at genepi.io.text.AbstractLineReader.next(AbstractLineReader.java:46)

  • In the newest Minimac 4.1.x series, Minimac has been changed to stop the job if too many allele swaps are detected. QC does not check this, and the error is indicated only in minimac stdout (--> not captured by the admin or user level job logs)

Imputing chrREDACTED:x-y ...
Loading target haplotypes ...
Loading target haplotypes took 0 seconds
Loading reference haplotypes ...
Loading reference haplotypes took 22 seconds
Typed sites to imputed sites ratio: 0 (0/redacted)
Error: not enough target variants are available to impute this chunk. The --min-ratio, --chunk, or --region options may need to be altered.

Expected behavior

  • Document required VCF fields such as GT in the data preparation docs. (if appropriate)
  • Handle the two exception cases noted above, and provide helpful messages that will appear in the part of the job report visible to regular users
  • Provide a fallback message for any other unhandled error types, indicating that a user should reach out to the helpdesk.
  • Consider adding some sort of logging event for unhandled error cases that stop the QC flow, so that developers can identify future edge cases that might be confusing.
@abought abought changed the title Uncaught exceptions lead to unclear job failures [bug] Uncaught exceptions lead to unclear job failures Jul 13, 2023
@abought
Copy link
Collaborator Author

abought commented Jul 17, 2023

One more for the list. I have no idea what is causing this one, since failed files are deleted from the server. The entire error message is quite literally just "9". This error is less common, but very confusing.

User is shown:

Input Validation
9 (see Help).

Admin logs are not much more helpful:

23/07/17 15:39:28 Input Validation [ERROR]
23/07/17 15:39:28 Executing onFailure...

Note that the message links to a page that does not seem to exist in TIS.

@abought
Copy link
Collaborator Author

abought commented Mar 16, 2024

A user recently submitted haploid data, which cause eagle to fail during the phasing step. There was no user or system facing error; we were only able to diagnose this by diving into the raw hadoop logs. (I believe it was in chr22)

The user suggests a future improvement of checking for this during QC, and/or providing error information from the phasing step.

The eagle error was: ERROR: target genotypes contain haploid sample

@abought abought added the bug Something isn't working label Apr 22, 2024
@abought
Copy link
Collaborator Author

abought commented Sep 3, 2024

A user job failed during the QC step with following error in the private admin-only logs:

Quitting from lines 5-16 (qc-report.Rmd)
Error in read.table(input, header = FALSE, sep = "\t") :
no lines available in input
Calls: ... withCallingHandlers -> withVisible -> eval -> eval -> read.table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant