Suffers from bizarre bug in BioPython #8

rsharris · 2019-09-16T19:18:26Z

In rare cases, the fasta headers in the annotated output can lack one of the fields due to a seriously bizarre bug in BioPython's SeqIO.write() function.

This occurs if the sequence's length happens to be the same as the sequence's name. In this case the description DiscoverY generates, which starts with the length, is mis-interpreted inside SeqIO.write() as including the sequence name. And SeqIO.write() does you the 'favor' of removing that duplication.

This obviously can only happen if the contig names are numbers. Unfortunately for me the output of whatever assembler create my contigs file does use numbers for names. And one of them happened to match the sequence length.

Why this is a problem is I was attempting to automatically convert the annotations into a table that I could process with other tools (e.g. R). But the table can't be correctly parsed due to the favor BioPython has done.

The only useful workaround I can see is that users should be warned (in the README) that their sequence names shouldn't be numbers.

rsharris · 2019-09-17T13:15:27Z

There is a workaround for this, which discoverY (and anything else using BioPython to write fasta) should employ. record.id should be included in description, like this: description=id=record.id + " " + str(length) + " " ...

See biopython/biopython#2270 for some discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suffers from bizarre bug in BioPython #8

Suffers from bizarre bug in BioPython #8

rsharris commented Sep 16, 2019

rsharris commented Sep 17, 2019

Suffers from bizarre bug in BioPython #8

Suffers from bizarre bug in BioPython #8

Comments

rsharris commented Sep 16, 2019

rsharris commented Sep 17, 2019