Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion/Documentation with IndexedFasta.getSequence #316

Open
IanSudbery opened this issue Mar 13, 2017 · 6 comments
Open

Confusion/Documentation with IndexedFasta.getSequence #316

IanSudbery opened this issue Mar 13, 2017 · 6 comments
Assignees

Comments

@IanSudbery
Copy link
Member

I have naively assuming that IndexedFasta.getSequence(contig, "-", start, end) would return the reverse complement of the sequence between contig:start..end. But this is only the case if the obscure mConverter attribute of the IndexedFasta object is set to a coordinate system that includes "both" in its name (e.g. fasta.setConverter(IndexedFasta.getConverter("zero-both-open")) ).

If the converter is not set, fetching fasta on the -ve strand actually returns the seqeuence contig:contig_length-end..contig_length-start.

This is so unobvious that the author of gff2fasta jumps though all sorts of hoops to get the correct sequence out of the fasta rather than use the coordinates converter.

This is a whole world of bugs just waiting to happen (I just ordered a whole load of primers that amplify absolutely nothing because they target completely the wrong sequences).

I recommend that the "zero-both-open" converter should be seq as default on the creation of a IndexedFasta object. But this will require making sure that all uses of IndexedFasta in the collection are compatible with this.

@AndreasHeger
Copy link
Member

Hi @IanSudbery , sorry about this. The reason for the default conversion is our comparative genomics past and follows exonerate output and UCSC practice for some formats such as the chain format, which uses reverse strand coordinates for the reverse strand.

Changing the default should be fine, but indeed will need some thought. I can put it on my todo list.

@IanSudbery
Copy link
Member Author

IanSudbery commented Mar 13, 2017 via email

@AndreasHeger AndreasHeger self-assigned this Mar 13, 2017
@IanSudbery
Copy link
Member Author

@AndreasHeger

Okay, there are still some weird things going on here.

GTF.py assumes that you have not set the converter and does some gymnastics with interval co-ordinates to get around that (It does these wrong BTW, the co-ordinates are off-by-one, so if you do GTF.toSequence on a transcript, the first base of the transcript is missing).

This means you can't open a fasta and use it both to convert your own co-ordinates to sequence and also use the same file in GTF.toSequence.

@AndreasHeger
Copy link
Member

Hi @IanSudbery , sorry about this. The reason for the default conversion is our comparative genomics past and follows exonerate output and UCSC practice for some formats such as the chain format, which uses reverse strand coordinates for the reverse strand.

Changing the default should be fine, but indeed will need some thought. I can put it on my todo list.

1 similar comment
@AndreasHeger
Copy link
Member

Hi @IanSudbery , sorry about this. The reason for the default conversion is our comparative genomics past and follows exonerate output and UCSC practice for some formats such as the chain format, which uses reverse strand coordinates for the reverse strand.

Changing the default should be fine, but indeed will need some thought. I can put it on my todo list.

@AndreasHeger
Copy link
Member

I will add some tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants