-
Notifications
You must be signed in to change notification settings - Fork 10
I want to run locally obtained data files through fidibus and get an empty *.prot.fa file. That does not seem right. What is the matter and how can I fix it?
Yes, you are right to suspect something amiss The answer is most likely a mismatch between your input files with respect to naming the protein models. Let's illustrate by one of our own examples.
We are interested in the Daphnia pulex genome/annotation as deposted at JGI. We create a working directory and download the relevant files like this
mkdir ~/DPWORK
cd ~/DPWORK
curl -O http://genome.jgi.doe.gov/Dappu1/download/Daphnia_pulex.fasta.gz
curl -O http://genome.jgi.doe.gov/Dappu1/download/FrozenGeneCatalog20110204.gff3.gz
curl -O http://genome.jgi.doe.gov/Dappu1/download/FrozenGeneCatalog20110204.proteins.fasta.gz
singularity pull --name aeagean.simg shub://BrendelGroup/AEGeAn
and then run our fidibus analysis:
singularity exec -e -B ~/DPWORK aegean.simg fidibus \
--workdir=./ \
--numprocs=2 \
--local \
--label=Dpul \
--gdna=Daphnia_pulex.fasta.gz \
--gff3=FrozenGeneCatalog20110204.gff3.gz \
--prot=FrozenGeneCatalog20110204.proteins.fasta.gz \
download prep iloci breakdown stats
only to get the following error message:
[]wc -l DpulBAD/*prot*
190120 Dpul/Dpul.all.prot.fa
30615 Dpul/Dpul.protein2ilocus.repr.tsv
30811 Dpul/Dpul.protein2ilocus.tsv
0 Dpul/Dpul.prot.fa
30614 Dpul/Dpul.protids.txt
It takes a bit of detective work, but at the end, the problem is the mismatch between the naming of proteins in the *.gff3 file and the *proteins.fasta file; see the FASTA headers in *proteins.fasta versuse the Name=* tags in the GFF3 file. Even after getting rid of the jgi|Dappu1|| prefix in the FASTA headers, there are remaining problems of names involving "|" (never a good idea in Linux ...). We can fix all of those issues with
gunzip -c FrozenGeneCatalog20110204.gff3.gz > FrozenGeneCatalog20110204.gff3
sed -e "s/|/./g" FrozenGeneCatalog20110204.gff3 > FrozenGeneCatalog20110204FIXED.gff3
gzip FrozenGeneCatalog20110204FIXED.gff3
\rm FrozenGeneCatalog20110204.gff3
gunzip -c FrozenGeneCatalog20110204.proteins.fasta.gz > FrozenGeneCatalog20110204.proteins.fasta
sed -e "s/^>jgi|Dappu1|[^|]*|/>/" FrozenGeneCatalog20110204.proteins.fasta | sed -e "s/|/./g" > FrozenGeneCatalog20110204FIXED.proteins.fasta
gzip FrozenGeneCatalog20110204FIXED.proteins.fasta
\rm FrozenGeneCatalog20110204.proteins.fasta
and now
singularity exec -e -B ~/DPWORK aegean.simg fidibus \
--workdir=./ \
--numprocs=2 \
--local \
--label=Dpul \
--gdna=Daphnia_pulex.fasta.gz \
--gff3=FrozenGeneCatalog20110204FIXED.gff3.gz \
--prot=FrozenGeneCatalog20110204FIXED.proteins.fasta.gz \
download prep iloci breakdown stats
works correctly:
[]wc -l Dpul/*prot*
190120 Dpul/Dpul.all.prot.fa
30615 Dpul/Dpul.protein2ilocus.repr.tsv
30811 Dpul/Dpul.protein2ilocus.tsv
189210 Dpul/Dpul.prot.fa
30614 Dpul/Dpul.protids.txt
Please direct all comments and suggestions to Volker Brendel at Indiana University.