Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--ref-fasta is missing or empty #200

Open
cegg opened this issue Oct 25, 2018 · 14 comments
Open

--ref-fasta is missing or empty #200

cegg opened this issue Oct 25, 2018 · 14 comments
Assignees

Comments

@cegg
Copy link

cegg commented Oct 25, 2018

Hello. Mac OS x El Capitan 10.11.6. After installing pre-reqs (VEP and samtools) and vcf2maf I got to the test step in the manual and got this error:

$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf
ERROR: Provided --ref-fasta is missing or empty: /my_user_dir/vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

Also: the same error happens when the command executed from a Docker container build from cloned github source.

@ckandoth
Copy link
Collaborator

Does that file exist? /my_user_dir/vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

If not, then it was likely not fully downloaded during installation of VEP. If you prefer using your own reference FASTA, then point vcf2maf to it using argument --ref-fasta.

@cegg
Copy link
Author

cegg commented Oct 26, 2018

After installing VEP (https://www.ensembl.org/info/docs/tools/vep/script/vep_download.html) there's no /my_user_dir/vep directory . There is however /my_user_dir/ensemble-vep which contains ensembl-vep/examples/homo_sapiens_GRCh37.vcf file (alogn with other samples).

Trying to use the path to that file with --ref-fasta results in slightly different error:

$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --ref-fasta= /my_user_dir/ensembl-vep/examples/homo_sapiens_GRCh37.vcf
[E::fai_build_core] Format error, unexpected "#" at line 1
[faidx] Could not load fai index of /my_user_dir/ensembl-vep/examples/homo_sapiens_GRCh37.vcf
ERROR: Provided --filter-vcf is missing or empty: /my_user_dir/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

probably because it's looking for a zip or something different all together. Probably the install procedure of VEP evolved. Might also be related to the fat that I had to bypass Bio::DB:HTS during the VEP install (does not work, here is the trick to get around: https://www.biostars.org/p/182780/).
Unfortunately I do not have my own references, will try to check with people from whom my .vcf fils are coming or learn how to get them from VEP...

@cegg
Copy link
Author

cegg commented Oct 26, 2018

UPDATE: I downloaded two files from Broad Institute website that the script complained about and placed them into locations where the script is looking for them:
/my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
/my_user_dir/.vep/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

The error "--ref=fasta is missing" is gone now, but new ones popped up:

$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf [E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip
[faidx] Could not load fai index of /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
Could not load .tbi/.csi index of /my_user_dir/.vep/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz
WARNING: Couldn't retrieve bps around 1:11290178 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 1:15557977 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 1:146728217 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 2:220439700 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 3:52437426 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 3:52437701 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 3:52443788 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 3:178928219 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 3:178936091 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 5:1295228 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 5:1295250 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 5:56177848 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 5:112174757 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 5:112174757 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 5:112174757 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 5:112174757 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 6:41903782 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 7:116412043 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 13:28608242 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
WARNING: Couldn't retrieve bps around 17:7579312 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
STATUS: Running VEP and writing to: tests/test.vep.vcf
ERROR: Cannot find VEP script in path: /my_user_dir/vep

@ckandoth
Copy link
Collaborator

The default locations of these files are based on the VEP installation instructions in the vcf2maf documentation - https://github.com/mskcc/vcf2maf#quick-start - if you used some other method to install VEP, then these files will be located elsewhere.

@cegg
Copy link
Author

cegg commented Oct 26, 2018

Just executed these:

export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4`
curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL; tar -zxf mskcc-vcf2maf.tar.gz; cd mskcc-vcf2maf-*
perl vcf2maf.pl --man
perl maf2maf.pl --man

No errors, seems good.
Then I tried the test:
$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf
and got the same error:

[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip
[faidx] Could not load fai index of /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
.
.
.
WARNING: Couldn't retrieve bps around 17:7579312 from reference FASTA: /my_user_dir/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
STATUS: Running VEP and writing to: tests/test.vep.vcf
ERROR: Cannot find VEP script in path: /my_user_dir/vep

Yes, I installed VEP by ensembl instructions - I am on Mac OSX and can't follow "this gist" https://gist.github.com/ckandoth/f265ea7c59a880e28b1e533a6e935697 since I do not have apt-get and maybe other linux stuff

@ckandoth
Copy link
Collaborator

ok great. If you followed that gist, then use --ref-fasta $VEP_DATA/homo_sapiens/86_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz where $VEP_DATA is whatever location you set that to during VEP installation.

@ckandoth
Copy link
Collaborator

oh sorry, you said you can't follow that gist. You do not need to run those linux steps. Other folks have had success installing VEP on a Mac.

@ckandoth
Copy link
Collaborator

You can try this gist for the more recent VEP v92 - https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97

@cegg
Copy link
Author

cegg commented Oct 30, 2018

Hi again Cyriac,
That gist also mentions apt-get / yum command that can't be executed on a Mac, but that does not matter now. After I point to the .fa file with the command-line switch explicitely, with this command the error about "--ref-fasta missing" goes away and two new ones pop up:
$ perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --ref-fasta /Users/ipozdnya/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
Here are the new errors:

Could not load .tbi/.csi index of /Users/my_user_dir/.vep/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz

STATUS: Running VEP and writing to: tests/test.vep.vcf

ERROR: Cannot find VEP script in path: /Users/my_user_dir/vep

The file /Users/my_user_dir/.vep/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz does exists, I checked. I also put in the same dir the unzipped version of it just in case, but that did not help. So, what do you think that .tbi/csi error is trying to tell me?
A colleague suggested I have to do some formating along these lines: http://seqanswers.com/forums/showthread.php?t=46131

Thanks

@ckandoth ckandoth self-assigned this Oct 30, 2018
@ckandoth
Copy link
Collaborator

You do not need the apt-get/yum commands on your Mac. The documentation states those are for Debian/Ubuntu and RHEL/CentOS only.

The .tbi file for ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz is created by running tabix -p vcf on it as documented in the gist.

By default, vcf2maf looks for VEP at ~/vep i.e. /Users/my_user_dir/vep. Since you followed different instructions to install VEP somewhere else, you need to point vcf2maf to it, using --vep-path /my_user_dir/ensemble-vep/ensembl-vep. If that does not work, please follow my gist for installing VEP. I cannot help you debug issues, if you deviate from the recommended instructions.

@cegg
Copy link
Author

cegg commented Oct 31, 2018

Thanks, I think I got through all the VEP installation steps in https://gist.github.com/ckandoth/5390e3ae4ecf182fa92f6318cfa9fa97 successfully, besides the final test. Just couple of minor points, probably only #4 needs to be resolved:

  1. on Mac OSX tar does not have -i switch (--ignore-zeroes) so I ran tar commands without it; for some reason I had to run them for each archive separately (samtools, bcftools and htstools), not sure if it has something to do with the absense of -i switch in my command but tar -jxf did not do anything in the command with all 3.
  2. in samtools section the last command cd .. does not make sense as the next one is referring to a location still in the same dir: bin/liftOver which will not be found if I do cd ..
  3. VEP install command complains about one test failing:
$ perl INSTALL.pl --AUTO a --DESTDIR ```
$VEP_PATH --CACHEDIR $VEP_DATA --NO_HTSLIB`  
-------------------- EXCEPTION --------------------
MSG: ERROR: Cannot use format gff without Bio::DB::HTS::Tabix module installed

I tried to install Bio::DB::HTS::Tabix by cloning it and running perl INSTALL.pl but that gives:

lzma.h library header not found in /usr/include. Please install it and try again.
On Debian/Ubuntu systems you can do this with the command:
  apt-get install liblzma-dev

Since I am not sure if I need it I tried to proceed to the last step (item 4)
4. Test running VEP in offline mode the provided sample VCFs. I think it does not work because there is another error in it about something called SIFT:

$ ./vep --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file examples/homo_sapiens_GRCh37.vcf --output_file examples/homo_sapiens_GRCh37.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory

-------------------- EXCEPTION --------------------
**MSG: ERROR: SIFT not available**

STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::check_sift_polyphen /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:168
STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::new /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:121
STACK Bio::EnsEMBL::VEP::CacheDir::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:142
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:121
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:93
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:175
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:123
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime)    = Wed Oct 31 19:11:01 2018
Ensembl API version = 94

If it's worth it, I can move items 3 and 4 into separate issues since now they very little to do with the issue that started this thread.

@cegg
Copy link
Author

cegg commented Nov 6, 2018

I installed Bio::DB::HTS using cpan so the error about tabix is gone. However, it did not help with the test, same error about SIFT missing. Sorry but google is not of much help with that one..

$ ./vep --species homo_sapiens --assembly GRCh37 --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/$VER\_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --input_file examples/homo_sapiens_GRCh37.vcf --output_file examples/homo_sapiens_GRCh37.vep.vcf --polyphen b --af --af_1kg --af_esp --regulatory

-------------------- EXCEPTION --------------------
MSG: ERROR: SIFT not available

STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::check_sift_polyphen /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:168
STACK Bio::EnsEMBL::VEP::AnnotationSource::Cache::Transcript::new /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/Cache/Transcript.pm:121
STACK Bio::EnsEMBL::VEP::CacheDir::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:142
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:121
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:93
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:175
STACK Bio::EnsEMBL::VEP::Runner::init /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:123
STACK Bio::EnsEMBL::VEP::Runner::run /Users/ipozdnya/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel ./vep:224
Date (localtime)    = Tue Nov  6 17:20:19 2018
Ensembl API version = 94
---------------------------------------------------

@ckandoth
Copy link
Collaborator

Hello. Are you still having the SIFT not available error? Let me know.

@cegg
Copy link
Author

cegg commented Mar 18, 2019

Hi, thanks for checking. Yes, it's the same error but I did not change anything since then.. Should I try to upgrade something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants