Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTA parser should generate unique ids #591

Open
cjmyers opened this issue May 16, 2019 · 3 comments
Open

FASTA parser should generate unique ids #591

cjmyers opened this issue May 16, 2019 · 3 comments
Assignees
Labels
Milestone

Comments

@cjmyers
Copy link
Contributor

cjmyers commented May 16, 2019

A FASTA file like the one below will generate three sequences with the same id (_1T38). It should instead generate unique ids.

1T38:A|PDBID|CHAIN|SEQUENCE
MRGSHHHHHHGSMDKDCEMKRTTLDSPLGKLELSGCEQGLHEIKLLGKGTSAADAVEVPAPAAVLGGPEPLMQCTAWLNA
YFHQPEAIEEFPVPALHHPVFQQESFTRQVLWKLLKVVKFGEVISYQQLAALAGNPKAARAVGGAMRGNPVPILIPSHRV
VCSSGAVGNYSGGLAVKEWLLAHEGHRL
1T38:B|PDBID|CHAIN|SEQUENCE
GCCATGGCTAGTA
1T38:C|PDBID|CHAIN|SEQUENCE
TACTAGCCATGGC

@jakebeal
Copy link
Contributor

Shouldn't the IDs be 1T38:A, 1T38:B, and 1T38:C?

@cjmyers
Copy link
Contributor Author

cjmyers commented May 17, 2019

Currently the header line is split at ":" with left side being displayId and right side being description. I swear I saw this mentioned somewhere, but I now cannot find it. I did find this interesting blog post about this issue:

http://www.acgt.me/blog/2013/6/25/the-fasta-file-format-a-showcase-for-the-best-and-worst-of-b.html

Apparently, the whole line must be unique, but any subset of it is not guaranteed to be unique. Probably the only solution is to take the header line and hash it to try to come up with some sort of unique id.

@jakebeal
Copy link
Contributor

I think that making a hash is a good way to check if you're dealing with duplicates or not.

I would strongly suggest staying away from using hashes in names except when forced, however, as that will tend to break the relationship with UIDs used elsewhere. Searching for 1T38, for example, brings up what you'd want to find for these sequences in a whole bunch of databases, so it can't just be delegated to provenance.

My suggestion:

  1. Record the most common formats (e.g., NCBI) and try parsing with them to see if we can be assured of having the right ID.
  2. Check for duplicates using hashing.
  3. If we don't understand the header and have a conflict, then as a fallback, differentiate with ID-hash[6 char hash]

@cjmyers cjmyers self-assigned this Jul 11, 2019
@cjmyers cjmyers added the change label Jul 11, 2019
@cjmyers cjmyers added this to the SBOL 2.5 milestone Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants