FASTA parser should generate unique ids #591

cjmyers · 2019-05-16T19:19:08Z

A FASTA file like the one below will generate three sequences with the same id (_1T38). It should instead generate unique ids.

1T38:A|PDBID|CHAIN|SEQUENCE
MRGSHHHHHHGSMDKDCEMKRTTLDSPLGKLELSGCEQGLHEIKLLGKGTSAADAVEVPAPAAVLGGPEPLMQCTAWLNA
YFHQPEAIEEFPVPALHHPVFQQESFTRQVLWKLLKVVKFGEVISYQQLAALAGNPKAARAVGGAMRGNPVPILIPSHRV
VCSSGAVGNYSGGLAVKEWLLAHEGHRL
1T38:B|PDBID|CHAIN|SEQUENCE
GCCATGGCTAGTA
1T38:C|PDBID|CHAIN|SEQUENCE
TACTAGCCATGGC

jakebeal · 2019-05-16T20:03:09Z

Shouldn't the IDs be 1T38:A, 1T38:B, and 1T38:C?

cjmyers · 2019-05-17T14:27:33Z

Currently the header line is split at ":" with left side being displayId and right side being description. I swear I saw this mentioned somewhere, but I now cannot find it. I did find this interesting blog post about this issue:

http://www.acgt.me/blog/2013/6/25/the-fasta-file-format-a-showcase-for-the-best-and-worst-of-b.html

Apparently, the whole line must be unique, but any subset of it is not guaranteed to be unique. Probably the only solution is to take the header line and hash it to try to come up with some sort of unique id.

jakebeal · 2019-05-17T15:08:20Z

I think that making a hash is a good way to check if you're dealing with duplicates or not.

I would strongly suggest staying away from using hashes in names except when forced, however, as that will tend to break the relationship with UIDs used elsewhere. Searching for 1T38, for example, brings up what you'd want to find for these sequences in a whole bunch of databases, so it can't just be delegated to provenance.

My suggestion:

Record the most common formats (e.g., NCBI) and try parsing with them to see if we can be assured of having the right ID.
Check for duplicates using hashing.
If we don't understand the header and have a conflict, then as a fallback, differentiate with ID-hash[6 char hash]

cjmyers self-assigned this Jul 11, 2019

cjmyers added the change label Jul 11, 2019

cjmyers added this to the SBOL 2.5 milestone Jul 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FASTA parser should generate unique ids #591

FASTA parser should generate unique ids #591

cjmyers commented May 16, 2019

jakebeal commented May 16, 2019

cjmyers commented May 17, 2019

jakebeal commented May 17, 2019

FASTA parser should generate unique ids #591

FASTA parser should generate unique ids #591

Comments

cjmyers commented May 16, 2019

jakebeal commented May 16, 2019

cjmyers commented May 17, 2019

jakebeal commented May 17, 2019