Subsetting databases #39

patrickbryant1 · 2023-08-09T07:58:01Z

Hi,

Thank you for the great resource!

I am having trouble subsetting databases and decompressing subsets of the databases you provide here: https://foldcomp.steineggerlab.workers.dev

According to the instructions, I should be able to decompress a subset of a database given an "id_list.txt".

This is how I do it for e.g. A. thaliana:

head -n 1 data/a_thaliana.lookup
0 AF-A0A178UFC4-F1-model_v4.pdb 0

As I understand it, the ID here is "AF-A0A178UFC4-F1-model_v4".

Now, I write this into a file called id_list.txt, then I run the command:
foldcomp decompress --id-list id_list.txt data/a_thaliana

with the response:
Decompressing files in data/a_thaliana using 1 threads
Output directory: data/a_thaliana_pdb/
[Warning] AF-A0A178UFC4-F1-model_v4 not found in database.

I have tried many different ways of naming the ids based on what is in a_thaliana.lookup, but nothing seems to work. The same using mmseqs to subset the database:
"""
createsubdb --subdb-mode 0 --id-mode 1 id_list.txt a_thaliana test_sel/output_foldcomp_db

MMseqs Version: ad6dfc66d7bbc4fd626fc19adf10ba587bc137c4
Subdb mode 0
Database ID mode 1
Verbosity 3

Could not find name AF-A0A178UFC4-F1-model_v4 in lookup
Time for merging to output_foldcomp_db: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 34ms
"""

Can you please explain what I am doing wrong and how to properly specify the IDs?

Best,

Patrick

patrickbryant1 · 2023-08-09T08:19:47Z

I noticed, this seems to work with afdb_rep_v4. Perhaps something is missing from the reference genomes?

khb7840 · 2023-08-09T11:22:49Z

I'm sorry there was a bug at assigning mode for database reading. Thank you for notifying this and please check if this is solved in the latest version.

patrickbryant1 · 2023-08-09T12:23:09Z

Hi,
Great - thanks.
What do you mean with the latest version:

Of the database from https://foldcomp.steineggerlab.workers.dev
Of Foldcomp
Something else(?)

khb7840 · 2023-08-09T13:55:35Z

Latest version of Foldcomp. Subsetting 'a_thaliana' should work with foldcomp of latest commit.

patrickbryant1 · 2023-08-09T14:14:51Z

Ok, great. Does this include the binaries you distribute or only the pip installation/git clone?
Do you know why mmseqs2 seems to fail on the same files? Is there something missing in the subsetting instructions there as well?

khb7840 · 2023-08-09T14:44:04Z

Please use git clone to get the latest updare. Python distribution is not updated with the latest commit. For the mmseqs2 part, I'm not sure what happened. I'll check this with mmseqs2 developers.

patrickbryant1 · 2023-08-09T14:58:37Z

Ok, thanks for the help!

khb7840 added a commit that referenced this issue Aug 9, 2023

Fix for Subsetting databases #39

e856340

github-actions bot pushed a commit that referenced this issue Aug 9, 2023

Fix for Subsetting databases #39

50fc718

khb7840 added the bug Something isn't working label Aug 9, 2023

valentynbez mentioned this issue Nov 7, 2023

Database extraction failed #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsetting databases #39

Subsetting databases #39

patrickbryant1 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023

khb7840 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023

khb7840 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023 •

edited

Loading

khb7840 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023

Subsetting databases #39

Subsetting databases #39

Comments

patrickbryant1 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023

khb7840 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023

khb7840 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023 • edited Loading

khb7840 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023

patrickbryant1 commented Aug 9, 2023 •

edited

Loading