You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the header line is split at ":" with left side being displayId and right side being description. I swear I saw this mentioned somewhere, but I now cannot find it. I did find this interesting blog post about this issue:
Apparently, the whole line must be unique, but any subset of it is not guaranteed to be unique. Probably the only solution is to take the header line and hash it to try to come up with some sort of unique id.
I think that making a hash is a good way to check if you're dealing with duplicates or not.
I would strongly suggest staying away from using hashes in names except when forced, however, as that will tend to break the relationship with UIDs used elsewhere. Searching for 1T38, for example, brings up what you'd want to find for these sequences in a whole bunch of databases, so it can't just be delegated to provenance.
My suggestion:
Record the most common formats (e.g., NCBI) and try parsing with them to see if we can be assured of having the right ID.
Check for duplicates using hashing.
If we don't understand the header and have a conflict, then as a fallback, differentiate with ID-hash[6 char hash]
A FASTA file like the one below will generate three sequences with the same id (_1T38). It should instead generate unique ids.
The text was updated successfully, but these errors were encountered: