Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WorkflowError - missing file #73

Open
bchesnut opened this issue Sep 14, 2024 · 5 comments
Open

WorkflowError - missing file #73

bchesnut opened this issue Sep 14, 2024 · 5 comments

Comments

@bchesnut
Copy link

I am getting the following error while running the test script for CADD 1.7:

$ ./CADD.sh -a -g GRCh38 -o ~/tmp/cadd/output_inclAnno_GRCh37.tsv.gz ./test/input.vcf
CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charité - Universitätsmedizin Berlin 2013-2023. All rights reserved.
Running snakemake pipeline:
snakemake /tmp/tmp.Zmoud0EEpB/input.tsv.gz --use-conda --conda-prefix /data/analysis/src/CADD-scripts-1.7/envs/conda --cores 1
--configfile /data/analysis/src/CADD-scripts-1.7/config/config_GRCh38_v1.7.yml --snakefile /data/analysis/src/CADD-scripts-1.7/Snakefile -q
host: vlp-dmpianal06.dhe.duke.edu
WorkflowError in rule join in file /data/analysis/src/CADD-scripts-1.7/Snakefile, line 303:
Failed to open input file: /tmp/tmp.Zmoud0EEpB/input.anno.novel.vcf. Has it been deleted by another process? (rule join, line 612, /data/analysis/src/CADD-scripts-1.7/Snakefile)

Verbose output using -p is attached. cadd-output.txt

I'm running Red Hat EL9 and Miniforge3 conda with snakemake 8.20.3

Thank you in advance for suggestions.

@visze
Copy link
Collaborator

visze commented Sep 16, 2024

What CADD-scripts version you are using? If 1.7 then please use only Snakemake 7.x . For CADD-scripts1.7.1. your Snakemake version should be fine.

(From your command I am pretty sure you use 1.7 and not 1.7.1. so please use the latest cadd scripts version. A simple upgrade of the repo should be fine. No other data needed).

Do you run Snakemake locally or in a cluster environment? Some environments have difficulties to run it in /tmp.

Maybe you don't use the CADD.sh script avoiding the /tmp directory.

Then you have to modify this command:

snakemake /tmp/tmp.Zmoud0EEpB/input.tsv.gz --use-conda --conda-prefix /data/analysis/src/CADD-scripts-1.7/envs/conda --cores 1
--configfile /data/analysis/src/CADD-scripts-1.7/config/config_GRCh38_v1.7.yml --snakefile /data/analysis/src/CADD-scripts-1.7/Snakefile -q

@bchesnut
Copy link
Author

@visze Thank you for the comments. I am running CADD 1.7 and following the README.md directions per https://github.com/kircherlab/CADD-scripts, which specify using Snakemake version 8.

I am running the CADD.sh script. I tried setting TMPDIR=~/caddtmp to avoid using /tmp, but getting similar missing file error.

I tried CADD 1.7.1 with different/worse results. README.md mentions using apptainer/singularity, but no specifics.

@visze
Copy link
Collaborator

visze commented Sep 16, 2024

I am very sure you use CADD-scripts v1.7.

For CADD-scripts v1.7.1 your command should look like:

CADD-scripts/CADD.sh

Lines 148 to 151 in 77df69a

command="snakemake $TMP_OUTFILE \
--sdm conda $SIGNULARITYARGS --conda-prefix $CADD/envs/conda \
--cores $CORES --configfile $CONFIG \
--snakefile $CADD/Snakefile $VERBOSE"

But It looks like v1.7:

CADD-scripts/CADD.sh

Lines 126 to 127 in 203ee3b

snakemake $TMP_OUTFILE --use-conda --conda-prefix $CADD/envs/conda --cores $CORES \
--configfile $CONFIG --snakefile $CADD/Snakefile $VERBOSE

Snakemake v1.7 requires snakemake 7.X which is mentioned in it's readme:

CADD-scripts/README.md

Lines 56 to 59 in 203ee3b

- snakemake 7.X (installed via mamba)
```bash
mamba install -c conda-forge -c bioconda 'snakemake=7'
```

You are referring to the latest (CADD-scripts v1.7.1 release) Readme which is correct: there snakemake 8.X sould be used. Apptainer will only work with CADD-scripts v1.7.1 and it is the default in CADD.sh. If you want to disable it use the -m option

Can you show me the I tried CADD 1.7.1 with different/worse results. results?

Please

@bchesnut
Copy link
Author

Installed CADD 1.7.1 in /data/workspace/bchesnut/CADD-scripts-1.7.1

Linked /data/workspace/bchesnut/CADD-scripts-1.7.1/data to data location:

$ cd /data/workspace/bchesnut/CADD-scripts-1.7.1
$ rm -rf data
$ ln -s /dmpi/analysis/analysis_data/CADD data

Set some environment variables:

$ export TMPDIR=/data/workspace/bchesnut/tmp
$ export APPTAINER_CACHEDIR=/data/workspace/bchesnut/apptainer

Ran ./install.sh

Ran ./CADD.sh -p -a -g GRCh38 -o ./output_inclAnno_GRCh38.tsv.gz ./test/input.vcf

$ ./CADD.sh -p -a -g GRCh38 -o ./output_inclAnno_GRCh38.tsv.gz ./test/input.vcf
CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charite - Universitatsmedizin Berlin 2013-2024. All rights reserved.
Running snakemake pipeline:
snakemake /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.tsv.gz --sdm conda apptainer --apptainer-prefix /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/apptainer --singularity-args "--bind /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS " --conda-prefix /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/conda --cores 1 --configfile /data/workspace/bchesnut/CADD-scripts-1.7.1/config/config_GRCh38_v1.7.yml --snakefile /data/workspace/bchesnut/CADD-scripts-1.7.1/Snakefile -p
Assuming unrestricted shared filesystem usage.
host: vlp-dmpianal06.dhe.duke.edu
Building DAG of jobs...
Pulling singularity image docker://visze/cadd-scripts-v1_7:0.1.0.
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job         count
--------  -------
join            1
prepare         1
prescore        1
total           3

Select jobs to execute...
Execute 1 jobs...

[Mon Sep 16 13:38:24 2024]
localrule prepare:
    input: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.vcf
    output: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf
    log: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepare.log
    jobid: 2
    reason: Missing output files: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf
    wildcards: file=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input
    resources: tmpdir=/data/workspace/bchesnut/tmp


        cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.vcf         | python /data/workspace/bchesnut/CADD-scripts-1.7.1/src/scripts/VCF2vepVCF.py         | sort -k1,1 -k2,2n -k4,4 -k5,5         | uniq > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf 2> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepare.log

Activating singularity image /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/apptainer/cbbe741652f49b1cd0ee6ebf25427cc2.simg
Activating conda environment: ../../../../conda-envs/a4fcaaffb623ea8aef412c66280bd623
[Mon Sep 16 13:38:28 2024]
Finished job 2.
1 of 3 steps (33%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Sep 16 13:38:29 2024]
localcheckpoint prescore:
    input: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf, /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno
    output: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf, /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
    log: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log
    jobid: 1
    reason: Missing output files: <TBD>; Input files updated by another job: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf
    wildcards: file=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input
    resources: tmpdir=/data/workspace/bchesnut/tmp
DAG of jobs will be updated after completion.


        # Prescoring
        echo '## Prescored variant file' > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv 2> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
        PRESCORED_FILES=`find -L /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno -maxdepth 1 -type f -name \*.tsv.gz | wc -l`
        cp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new
        if [ ${PRESCORED_FILES} -gt 0 ];
        then
            for PRESCORED in $(ls /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno/*.tsv.gz)
            do
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new                 | python /data/workspace/bchesnut/CADD-scripts-1.7.1/src/scripts/extract_scored.py --header                     -p $PRESCORED --found_out=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp                 > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp 2>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp >> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
                mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new &> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
            done;
            rm /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log
        fi
        mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log

Activating singularity image /data/workspace/bchesnut/CADD-scripts-1.7.1/envs/apptainer/cbbe741652f49b1cd0ee6ebf25427cc2.simg
Activating conda environment: ../../../../conda-envs/a4fcaaffb623ea8aef412c66280bd623
find: ‘/data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno’: No such file or directory
[Mon Sep 16 13:38:29 2024]
Error in rule prescore:
    jobid: 1
    input: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf, /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno
    output: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf, /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
    log: /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log (check log file(s) for error details)
    conda-env: /conda-envs/a4fcaaffb623ea8aef412c66280bd623
    shell:

        # Prescoring
        echo '## Prescored variant file' > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv 2> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
        PRESCORED_FILES=`find -L /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno -maxdepth 1 -type f -name \*.tsv.gz | wc -l`
        cp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new
        if [ ${PRESCORED_FILES} -gt 0 ];
        then
            for PRESCORED in $(ls /data/workspace/bchesnut/CADD-scripts-1.7.1/data/prescored/GRCh38_v1.7/incl_anno/*.tsv.gz)
            do
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new                 | python /data/workspace/bchesnut/CADD-scripts-1.7.1/src/scripts/extract_scored.py --header                     -p $PRESCORED --found_out=/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp                 > /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp 2>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
                cat /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp >> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
                mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.tmp /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new &> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log;
            done;
            rm /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv.tmp &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log
        fi
        mv /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prepared.vcf.new /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.novel.vcf &>> /data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.prescore.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job prescore since they might be corrupted:
/data/workspace/bchesnut/tmp/tmp.ZwC6gGaiGS/input.pre.tsv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-09-16T132855.935867.snakemake.log
WorkflowError:
At least one job did not complete successfully.

@visze
Copy link
Collaborator

visze commented Sep 16, 2024

Ok. Two things I see. First can you bgzip your input file to input.vcf.gz. but not sure if it change anything.

Second which can be a trouble maker too: Paths have to be correctly set for apptainer images (apptainer command --bind). You have to bind a lot of them extra. E.g. tmp folder,.... Otherwise tmp of singularity image is used and then wiped and next time loaded not there anymore.

I tested it on my end..it worked but you never know on other systems...

So maybe first recommendation is to use only mamba first (-m) flag in the CADD.sh script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants