Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stderr=FAILURE: bad fasta character #1466

Open
macmanes opened this issue Aug 20, 2024 · 7 comments
Open

stderr=FAILURE: bad fasta character #1466

macmanes opened this issue Aug 20, 2024 · 7 comments

Comments

@macmanes
Copy link

Hi All - having many jobs fail with this type of error, which seems to indicated perhaps something about a poorly formatted fasta file? One issue is that I can't seem to find the offending file GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa. Not sure if this is related to some upstream error that was not handled properly.

=========>                                                                                                                                                                                                                                              
        [2024-08-20T08:17:36-0400] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---                                                                                                                                                          
        [2024-08-20T08:17:36-0400] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host node146.rcchpc.                                                                                                  
        [2024-08-20T08:17:36-0400] [MainThread] [I] [toil.worker] Working on job 'run_lastz' kind-run_lastz/a/instance-f8hw5ee6 v10                                                                                                                     
        [2024-08-20T08:17:37-0400] [MainThread] [I] [toil.worker] Loaded body Job('run_lastz' kind-run_lastz/a/instance-f8hw5ee6 v10) from description 'run_lastz' kind-run_lastz/a/instance-f8hw5ee6 v10                                               
        [2024-08-20T08:17:38-0400] [MainThread] [I] [toil.statsAndLogging] For distance 0.022243274 for genomes files/for-job/kind-make_chunked_alignments/instance-2auibmvz/cleanup/file-6e5a082d6ee9443e87f430a098014ee4/5.fa, files/for-job/kind-make
_chunked_alignments/instance-2auibmvz/cleanup/file-18a0b457ba4041949eb618d520e4aebf/5.fa using --step=2 --ambiguous=iupac,100,100 --ydrop=3000 --notransition lastz parameters                                                                          
        [2024-08-20T08:17:38-0400] [MainThread] [I] [cactus.shared.common] Running the command ['lastz', 'GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa[multiple][nameparse=darkspace]', 'sandy.combined.contigs.arrow.purged_5.fa[nameparse=darkspace]', '--form
at=paf:minimap2', '--step=2', '--ambiguous=iupac,100,100', '--ydrop=3000', '--notransition']                                                                                                                                                            
        [2024-08-20T08:17:38-0400] [MainThread] [I] [toil-rt] 2024-08-20 08:17:38.327880: Running the command: "lastz GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa[multiple][nameparse=darkspace] sandy.combined.contigs.arrow.purged_5.fa[nameparse=darkspace] 
--format=paf:minimap2 --step=2 --ambiguous=iupac,100,100 --ydrop=3000 --notransition"                                                                                                                                                                   
        [2024-08-20T08:17:38-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:                                                                                                                                      
        [2024-08-20T08:17:38-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-2auibmvz/cleanup/file-6e5a082d6ee9443e87f430a098014ee4/5.fa' to path '/tmp/toilwf-fffa0892c
a3c520588bda60e7418db98/93ae/job/tmphkrn84lv/GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa'                                                                                                                                                                      
        [2024-08-20T08:17:38-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-2auibmvz/cleanup/file-18a0b457ba4041949eb618d520e4aebf/5.fa' to path '/tmp/toilwf-fffa0892c
a3c520588bda60e7418db98/93ae/job/tmphkrn84lv/sandy.combined.contigs.arrow.purged_5.fa'                                                                                                                                                                  
        [2024-08-20T08:17:38-0400] [MainThread] [C] [toil.worker] Worker crashed with traceback:                                                                                                                                                        
        Traceback (most recent call last):                                                                                                                                                                                                              
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/worker.py", line 438, in workerScript                                                                                                            
            job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer)                                                                                                                                                            
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/job.py", line 2984, in _runner                                                                                                                   
            returnValues = self._run(jobGraph=None, fileStore=fileStore)                                                                                                                                                                                
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/job.py", line 2895, in _run                                                                                                                      
            return self.run(fileStore)                                                                                                                                                                                                                  
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/job.py", line 3158, in run                                                                                                                       
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)                                                                                                                                                                       
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/cactus/paf/local_alignment.py", line 67, in run_lastz                                                                                                 
            segalign_messages = cactus_call(parameters=lastz_cmd, outfile=alignment_file, work_dir=work_dir, returnStdErr=gpu>0, gpus=gpu,                                                                                                              
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/cactus/shared/common.py", line 910, in cactus_call                                                                                                    
            raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))                                                                                                                                          
        RuntimeError: Command ['lastz', 'GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa[multiple][nameparse=darkspace]', 'sandy.combined.contigs.arrow.purged_5.fa[nameparse=darkspace]', '--format=paf:minimap2', '--step=2', '--ambiguous=iupac,100,100', '--ydr
op=3000', '--notransition'] exited 1: stderr=FAILURE: bad fasta character in GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa, >id=GCF_011100685.1_UU_Cfam_GSD_1.0|NC_049228.1|81081596|0 (greater than sign ">")                                                   
        remove or replace non-ACGTN characters or consider using --ambiguous=iupac                                                                                                                                                                      
                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                        
        [2024-08-20T08:17:38-0400] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host node146.rcchpc                                                                                                                     
<=========                                                                                           

Any help here greatly appreciated!

@macmanes
Copy link
Author

Note searching the original file for illicit characters does not return anything..

grep -v '^>' GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.fna.masked | grep -o '[^ATCGactgNn]'

@glennhickey
Copy link
Collaborator

That's a new one. Given the error says (greater than sign ">") perhaps it's an empty sequence names that's triggering it? Cactus does its own check for non-agctn characters well upstream of this, so it's likely something less simple.

You should be able to confirm by creating a work directory (I'm using ./work below) and rerunning the failing command with these flags added:

--restart --caching false --cleanWorkDir never --workDir ./work

then pull the relevant fasta files being input into lastz out of find ./work and inspect them yourself.

@macmanes
Copy link
Author

yup so there are some issues with some of the fasta files - but not empty headers.

1st lines of one of the offending fasta files (which I wrapped to make parsing easier) - looks good here.

Line 1: >id=GCF_009873245.2_mBalMus1.pri.v3|NC_045787.1|171266408|90000000
Line 2: ctttaatccattttgagtttatttttgtgtgtggtgttaggaagtgttctaatttcattcttttacatgtagctgtccagttttcccagcaccacttatt
Line 3: gaagaggctgtcttttctccactgtatattcttgcctcctttgtcaaagataaggtgaccatatctgcgtgggtttatctctgggctttctatcctgttc

Somewhere in the middle of the file a newline character was missed for a 2nd fasta entry.

Line 783808: attgacacgtggcactgaacccagagtggcaagtcttccccgtttcccagagaacccacaattccccgtcctatgtgaaatcccccaagttttaaatacc
Line 783810: GACATTATAGATACATTTGATAATTAAAAGGAATAGTACGTATTCCAGCTAGGAGGAGGAGCCCTCCTTTTCGACTGGTTTTAGTCGATTAAGAAGGTTG
Line 783811: TGGGGTTTTGTATGTATGTTAAGATGATACCAGTTTTTGTCTTCATCACGGCTCTGAGCTGTTCAGATAGCTTATTCATCTAAGGTGAG>id=GCF_009
Line 783812: 873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0acaaggagtagcccccactagccacaactagaggaagtccacatgcagcaat
Line 783813: gaagacacaacgcagccaaaaataaataatgaataaataaataagttaattaattaattaattaaaaaaataagagtagagtggaaattcaggaagttga

I certainly hope this is not a fatal flaw requiring a total restart but i suspect it is. Wondering about the cause here. Any ideas?

@macmanes
Copy link
Author

@glennhickey any chance it's the pipes in the fasta headers that are causing this issue?

@glennhickey
Copy link
Collaborator

Could be. When I try to change some names in the test data to look like yours, I get an error right away

RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '_', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: 'id=simMouse_chr6|873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0' in 'simMouse_chr6'

@macmanes
Copy link
Author

it's funny that I don't get that error until much later - does "sanitize_fasta" deal with |'s? They are sadly common in NCBI downloaded genomes.

Anyway I went for the clean/full restart to see if the error is reproducible or if it could have been related to some ?transient read/write issue.

@glennhickey
Copy link
Collaborator

That check is on by default because the ucsc genome browser doesn't (or didn't) support these characters in assembly hubs. If I disable the check by setting checkAssemblyHub="0" in the config, then my test runs through fine.

halStats em.hal --sequenceStats simMouse_chr6
SequenceName, Length, NumTopSegments, NumBottomSegments
873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0, 636262, 46692, 0
873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0, 850, 104, 0
873245.2_mBalMus1.pri.v4|NC_045788.1|144968589|0, 1250, 129, 0

but since the check is on by default, I don't know why it didn't complain for you. It's not in cactus_santiize_fasta_headers but slightly upstream when running cactus.

In any case, I do not know what caused your error, and suggest double checking your input file. But if you are sure it's cactus causing the problem, please send me the input so I can try to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants