Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.3.1 breaks pipelines #96

Open
matthdsm opened this issue Dec 9, 2024 · 7 comments
Open

v0.3.1 breaks pipelines #96

matthdsm opened this issue Dec 9, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@matthdsm
Copy link
Collaborator

matthdsm commented Dec 9, 2024

nf-nomad: v0.3.1
nextflow: 24.10.1

Dec-09 10:01:58.288 [main] DEBUG nextflow.Session - Session await > all processes finished
Dec-09 10:01:58.335 [Task monitor] DEBUG n.nomad.executor.NomadService - Task nf-b28b726027e5a21c98978aee74e53c0b-SAMTOOLS_INDEX_CFD240379 , state=null
Dec-09 10:01:58.348 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] Check jobState: jobName=nf-b28b726027e5a21c98978aee74e53c0b-SAMTOOLS_INDEX_CFD240379 currentState=null newState=unknown
Dec-09 10:01:58.353 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] checkIfCompleted task.name=SAMTOOLS_INDEX (CFD2403794); state=unknown completed=true
Dec-09 10:01:58.364 [Task monitor] WARN  n.nomad.executor.NomadTaskHandler - [NOMAD] Cannot read exit status for task: `SAMTOOLS_INDEX (CFD2403794)` | /scratch/b2/8b726027e5a21c98978aee74e53c0b/.exitcode
Dec-09 10:01:58.370 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] purgeJob with jobId=nf-b28b726027e5a21c98978aee74e53c0b-SAMTOOLS_INDEX_CFD240379

It seems like something goes wrong with the job polling and/or the reading of the exitcode files

@abhi18av
Copy link
Member

abhi18av commented Dec 10, 2024

Mmm, this is curious 🤔

I checked the #91 (comment) with a FusionFS based setup and I guess that this is only happening with the file system based work directories.

@matthdsm , just for good measure could you please run the nf-core/demo pipeline and share the full log here?

@abhi18av abhi18av added the bug Something isn't working label Dec 10, 2024
@abhi18av
Copy link
Member

Okay, I resurrected the tower-nf branch from #91 and was able to see a mixture of success and failures.

Monitor the execution with Seqera Platform using this URL: https://cloud.seqera.io/orgs/ABHINAVSHARMA-ORG/workspaces/nf-nomad-dev/watch/149NDFi6SHfeLB
executor >  nomad (fusion enabled) (7)
[52/bbf564] process > NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE)     [100%] 3 of 3 ✔
[96/201cff] process > NFCORE_DEMO:DEMO:SEQTK_TRIM (SAMPLE1_PE) [100%] 3 of 3 ✔
[c5/389881] process > NFCORE_DEMO:DEMO:MULTIQC                 [100%] 1 of 1, failed: 1 ✘
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/demo] Pipeline completed with errors-
WARN: [NOMAD] Cannot read exit status for task: `NFCORE_DEMO:DEMO:MULTIQC` | /fusionfs/integration-test/work/c5/3898817404eb2edcca42616c3b4776/.exitcode
ERROR ~ Error executing process > 'NFCORE_DEMO:DEMO:MULTIQC'

Caused by:
  nextflow.exception.ProcessUnrecoverableException


Command executed:

  multiqc \
      --force \
       \
      --config multiqc_config.yml \
       \
       \
      .
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_DEMO:DEMO:MULTIQC":
      multiqc: $( multiqc --version | sed -e "s/multiqc, version //g" )
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Work dir:
  s3://fusionfs/integration-test/work/c5/3898817404eb2edcca42616c3b4776

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details

(base) abhi@macbookpro19 validation % 


With the logs showing something similar


Dec-10 16:15:48.493 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 7; name: NFCORE_DEMO:DEMO:MULTIQC; status: COMPLETED; exit: -; error: nextflow.exception.ProcessUnrecoverableException; workDir: s3://fusionfs/integration-test/work/c5/3898817404eb2edcca42616c3b4776]
Dec-10 16:15:48.497 [TaskFinalizer-7] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_DEMO:DEMO:MULTIQC; work-dir=s3://fusionfs/integration-test/work/c5/3898817404eb2edcca42616c3b4776
  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_DEMO:DEMO:MULTIQC` failed
Dec-10 16:15:48.599 [TaskFinalizer-7] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /var/folders/fs/j6rkx5910yj7ls6mcpq_jcr80000gn/T/temp-s3-10611038858962295570/.command.out
Dec-10 16:15:48.678 [TaskFinalizer-7] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /var/folders/fs/j6rkx5910yj7ls6mcpq_jcr80000gn/T/temp-s3-824482206444636747/.command.err
Dec-10 16:15:48.681 [TaskFinalizer-7] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_DEMO:DEMO:MULTIQC'

Caused by:
  nextflow.exception.ProcessUnrecoverableException



@abhi18av
Copy link
Member

abhi18av commented Dec 10, 2024

@matthdsm , could you please try out https://github.com/nextflow-io/nf-nomad/releases/tag/0.3.2-edge1 and let us know if this is working as expected?

@matthdsm
Copy link
Collaborator Author

Mmm, this is curious 🤔

I checked the #91 (comment) with a FusionFS based setup and I guess that this is only happening with the file system based work directories.

@matthdsm , just for good measure could you please run the nf-core/demo pipeline and share the full log here?

Hi Abhi,

Here's the log with v0.3.1
nf-demo.log

@matthdsm
Copy link
Collaborator Author

And the edge version didn't fix it...
nf-demo-nomad0.3.2-edge1.log

@abhi18av
Copy link
Member

Thanks @matthdsm , from the logs it seems that the

Mmm, I think the status check isn't quite right since the state=null shouldn't occur. I have tried to address this in #98

Dec-10 19:06:57.073 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] checkIfRunning task=NFCORE_DEMO:DEMO:FASTQC (SAMPLE3_SE); state=pending
Dec-10 19:06:57.151 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] determineClientNode: jobName:nf-c6ccad64405cdf14121842209c1c91e1-NFCORE_DEMO_DEMO_FASTQC_; clientName:compute-87hv7j2
Dec-10 19:06:57.158 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] checkIfCompleted task=NFCORE_DEMO:DEMO:FASTQC (SAMPLE3_SE); state=pending
Dec-10 19:06:57.189 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] getTaskStatus nf-038c0c093855c52c0929c9070c38017f-NFCORE_DEMO_DEMO_SEQTK_T , state=null
Dec-10 19:06:57.199 [TaskFinalizer-1] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE)' -- Cause: java.nio.file.NoSuchFileException: /scratch/34/835ee334986da9230b3b079924328a/.command.out
Dec-10 19:06:57.200 [TaskFinalizer-1] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE)' -- Cause: java.nio.file.NoSuchFileException: /scratch/34/835ee334986da9230b3b079924328a/.command.err
Dec-10 19:06:57.201 [TaskFinalizer-1] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE)' -- Cause: java.nio.file.NoSuchFileException: /scratch/34/835ee334986da9230b3b079924328a/.command.log
Dec-10 19:06:57.205 [TaskFinalizer-1] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE)'

Caused by:
  nextflow.exception.ProcessUnrecoverableException


Command executed:

  printf "%s %s\n" sample1_R1.fastq.gz SAMPLE1_PE_1.gz sample1_R2.fastq.gz SAMPLE1_PE_2.gz | while read old_name new_name; do
      [ -f "${new_name}" ] || ln -s $old_name $new_name
  done
  
  fastqc \
      --quiet \
      --threads 6 \
      --memory 6144 \
      SAMPLE1_PE_1.gz SAMPLE1_PE_2.gz
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_DEMO:DEMO:FASTQC":
      fastqc: $( fastqc --version | sed '/FastQC v/!d; s/.*v//' )
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /scratch/34/835ee334986da9230b3b079924328a

Container:
  quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
D

@jagedn
Copy link
Collaborator

jagedn commented Dec 12, 2024

I can see SAMPLE1_PE is "dead" as soon is created

Dec-10 19:06:57.042 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] checkIfRunning task=NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE); state=dead
Dec-10 19:06:57.042 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] checkIfCompleted task=NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE); state=dead
Dec-10 19:06:57.043 [Task monitor] WARN  n.nomad.executor.NomadTaskHandler - [NOMAD] Cannot read exit status for task: `NFCORE_DEMO:DEMO:FASTQC (SAMPLE1_PE)` | /scratch/34/835ee334986da9230b3b079924328a/.exitcode

any clue/idea why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants