`highquality_cluster30` - fragmented sequences split on undetermined aminoacid #53

valentynbez · 2024-03-27T15:48:14Z

Hello!
I've tried using highquality_clust30 as a reference and identified the following issue.
The database has around 200k repeated entries, they appear to be fragmented proteins split on X aminoacid.
(The additional information from headers was removed, only unique MG IDs are stored in my FASTAs for indexing with samtools-faidx)

Example 1

> grep "MGYP003384474486" highquality_clust30.lookup                                                                                                                        
32543322        MGYP003384474486        0
32543327        MGYP003384474486        0
32543390        MGYP003384474486        0
32543528        MGYP003384474486        0
32543587        MGYP003384474486        0
> zgrep -A 1 "MGYP003384474486" highquality_clust30.fasta.gz                                                                                                                         
>MGYP003384474486
MFSSKCNLCR
--
>MGYP003384474486
IDQER
--
>MGYP003384474486
KYNEVKIY
--
>MGYP003384474486
ETIIGIYDF
--
>MGYP003384474486
FLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK

When I query ESM API I get

{"sequence": "MFSSKCNLCRXIDQERXKYNEVKIYXETIIGIYDFXFLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK"}

Example 2

> grep "MGYP003343806611" highquality_clust30.lookup                                                                                                                        
31381065        MGYP003343806611        0
31381071        MGYP003343806611        0
>zgrep -A 1 "MGYP003343806611" highquality_clust30.fasta.gz
>MGYP003343806611
MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWA
--
>MGYP003343806611
YY

ESM API

{"sequence": "MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWAXYYX"}

The text was updated successfully, but these errors were encountered:

- a hack using `sed` to correct headers in the database - 5 minute alignment of 5.7k proteins agains database - tolerable - greatly increased coverage of structures - silenced warnings in faidx - database contain fragmented sequnces steineggerlab/foldcomp#53 - fixes #80

valentynbez changed the title ~~highquality_cluster30 - fragmented sequences~~ highquality_cluster30 - fragmented sequences split on X aminoacid Mar 27, 2024

valentynbez changed the title ~~highquality_cluster30 - fragmented sequences split on X aminoacid~~ highquality_cluster30 - fragmented sequences split on undetermined aminoacid Mar 27, 2024

khb7840 added the help wanted Extra attention is needed label Apr 2, 2024

valentynbez mentioned this issue Jul 9, 2024

Errors in highquality_clust30 #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`highquality_cluster30` - fragmented sequences split on undetermined aminoacid #53

`highquality_cluster30` - fragmented sequences split on undetermined aminoacid #53

valentynbez commented Mar 27, 2024 •

edited

Loading

highquality_cluster30 - fragmented sequences split on undetermined aminoacid #53

highquality_cluster30 - fragmented sequences split on undetermined aminoacid #53

Comments

valentynbez commented Mar 27, 2024 • edited Loading

Example 1

Example 2

`highquality_cluster30` - fragmented sequences split on undetermined aminoacid #53

`highquality_cluster30` - fragmented sequences split on undetermined aminoacid #53

valentynbez commented Mar 27, 2024 •

edited

Loading