Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature.pkl.gz file not found #11

Open
maxochondria opened this issue Sep 11, 2023 · 13 comments
Open

Feature.pkl.gz file not found #11

maxochondria opened this issue Sep 11, 2023 · 13 comments

Comments

@maxochondria
Copy link

maxochondria commented Sep 11, 2023

Thank you for developing such a promising software for the scientific community.
I tried running Alphalink2 with several datasets, including the rpoa-rpoc dataset used in the bioRxiv manuscript but keep encountering the error below related to the creation of the feature.pgl.gz files. The file is either not created when running a monomeric prediction or only one of the file is created when running a multimeric prediction. I tried several iterations with different file names, fasta header names, crosslink file names, etc (no underscore, no dash, only A B C D... as fasta headers, etc ) but none of these changes solved the problem. I know that this issue has been raised and closed before but the previous fix doesn't seem to fix the problem in this case. Is there a specific naming convention I should adhere to?

I0911 22:57:46.547931 140125146163008 utils.py:40] Finished hmmbuild query in 1.798 seconds
I0911 22:57:46.552916 140125146163008 hmmsearch.py:117] Launching sub-process ['/usr/conda/envs/alphalink/bin/hmmsearch', '--noali', '--cpu', '8', '--F1', '0.1', '--F2', '0.1', '--F3', '0.1', '--incE', '100', '-E', '100', '--domE', '100', '--incdomE', '100', '-A', '/tmp/tmp1hjv21c0/output.sto', '/tmp/tmp1hjv21c0/query.hmm', '/home/anatale/alphafold2/alphafold_data//pdb_seqres/pdb_seqres.txt']
I0911 22:57:46.553389 140125146163008 utils.py:36] Started hmmsearch (pdb_seqres.txt) query
I0911 22:58:13.020413 140125146163008 utils.py:40] Finished hmmsearch (pdb_seqres.txt) query in 26.467 seconds
run_alphalink.2.sh: line 24: 29939 Killed $al_python $al_code_dir/unifold/homo_search.py --fasta_path=$fasta_path --max_template_date=$max_template_date --output_dir=$output_dir_base --uniref90_database_path=$database_dir/uniref90/uniref90.fasta --mgnify_database_path=$database_dir/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$database_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --uniclust30_database_path=$database_dir/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --uniprot_database_path=$database_dir/uniprot/uniprot.fasta --pdb_seqres_database_path=$database_dir/pdb_seqres/pdb_seqres.txt --template_mmcif_dir=$database_dir/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$database_dir/pdb_mmcif/obsolete.dat --use_precomputed_msas=True
Starting prediction...
2023-09-11 22:59:26.162401: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
start to load params /home/anatale/alphafold2/alphafold_data/AlphaLink-Multimer_SDA_v3.pt
start to predict RpoaRpoc
Traceback (most recent call last):
File "/home/mlevasseur/AlphaLink2/inference.py", line 363, in
main(args)
File "/home/mlevasseur/AlphaLink2/inference.py", line 141, in main
batch = load_feature_for_one_target(
File "/home/mlevasseur/AlphaLink2/inference.py", line 74, in load_feature_for_one_target
batch, _ = load_and_process(
File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 292, in load_and_process
features, labels = load(**load_kwargs, mode=mode, is_monomer=is_monomer)
File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 194, in load
all_chain_features = [
File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 195, in
load_single_feature(s, monomer_feature_dir, mode, uniprot_msa_dir, is_monomer)
File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 33, in wrapper
return copy_lib.copy(cached_func(*args, **kwargs))
File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 78, in load_single_feature
monomer_feature = utils.load_pickle(
File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 33, in wrapper
return copy_lib.copy(cached_func(*args, **kwargs))
File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 67, in load_pickle
ret = load(path)
File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 64, in load
with open_fn(path, "rb") as f:
File "/usr/conda/envs/alphalink/lib/python3.10/gzip.py", line 58, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/usr/conda/envs/alphalink/lib/python3.10/gzip.py", line 174, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/mlevasseur/proteomics/XL/benchmark4/RpoaRpoc/B.feature.pkl.gz'

@lhatsk
Copy link
Collaborator

lhatsk commented Sep 12, 2023

Hi,
It's hard for me to tell what's going on. But it looks like the process that generates the features gets killed for some reason by your OS/ submission system: "run_alphalink.2.sh: line 24: 29939 Killed"

Do you maybe run out of memory?

@maxochondria
Copy link
Author

Thanks for such a prompt reply! That indeed seems to be the problem. We ran another test which maxed out our 32GB of memory and crashed. How much memory do you recommend?

@lhatsk
Copy link
Collaborator

lhatsk commented Sep 13, 2023

I honestly have no idea how much memory is generally required. I use a cluster and I'm fine with 80GB most of the time, for larger targets I increase it to 300GB. Maybe switching to the reduced_dbs setting in unifold/homo_search.py can limit the memory consumption. This setting requires the small_bfd database.

@Bhaddad93
Copy link

Hi All, firstly thank you for creating alphalink2!

Unfortunately, I too am having problems with feature generation, and I have ensured all naming-schemes are consistent. The issue appears to be that it skips feature generation for 'seq_A', but completes for 'seq_B' and 'seq_C'..

input Fasta (seq.fasta):

A
....
B
....
C
....

chains.txt:
'A B C'

yet, I get the error that: a) chains.txt isn't found in outdir/seq/, (chains.txt exists in one folder above). b) when I copy chains.txt to outdir/seq/ I get the error that it cannot find A.features.pkl.gz, and indeed it doesn't exist, though B.features.pkl.gz and C.features.pkl.gz exist. I cannot understand why, presumably somewhere in unifold/homo_search.py, it is skipping the first fasta sequence, and going to the second.

+ run_alphalink.sh /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq.fasta /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/xlinks.patti.A2B.pkl.gz /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/ /home/bhaddad/alphalink/AlphaLink2/../AlphaLink-Multimer_SDA_v2.pt /home/bhaddad/alphafold/DataBases 2020-05-01 2024-06-21 12:02:00.660510: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /home/bhaddad/miniconda3/envs/alphalink/lib/python3.10/site-packages/Bio/Data/SCOPData.py:18: BiopythonDeprecationWarning: The 'Bio.Data.SCOPData' module will be deprecated in a future release of Biopython in favor of 'Bio.Data.PDBData. warnings.warn( I0621 12:02:29.087209 139711209805632 templates.py:945] Using precomputed obsolete pdbs /home/bhaddad/alphafold/DataBases/pdb_mmcif/obsolete.dat. I0621 12:02:29.132328 139711209805632 utils.py:74] Mapping multi-chains fasta with chain order: A B C I0621 12:02:29.137532 139711209805632 homo_search.py:161] searching homogeneous Sequences & structures for seq_B... I0621 12:02:29.143747 139711209805632 jackhmmer.py:140] Launching subprocess "/home/bhaddad/miniconda3/envs/alphalink/bin/jackhmmer -o /dev/null -A /tmp/tmp42vu50n8/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq_B.fasta /home/bhaddad/alphafold/DataBases/uniref90/uniref90.fasta" I0621 12:02:29.158784 139711209805632 utils.py:36] Started Jackhmmer (uniref90.fasta) query

@lhatsk
Copy link
Collaborator

lhatsk commented Jun 22, 2024

Hi,

Thanks for reporting this! There was a bug with chains.txt that should be fixed now. Please try again (ideally with a fresh output directory). For the rest, it's hard for me to tell what's going on. Chains should only be skipped if they are homomeric but here it would skip B instead of A. Can you share the FASTA file? What does the directory look like?

@Bhaddad93
Copy link

Thank you for such a quick response! I am currently performing a fresh install of AlphaLink2, and I will provide an update as soon as it is run!

Unfortunately the PI of the project would prefer I keep the sequences 'close to the vest' until publication (currently under review), but I can say with the utmost confidence that chains 'B' and 'C' are identical. What appears to be happening, is it skips 'A' for whatever reason, performs featurization for chain B, then simply copies the features of chain B for chain C (predictably).

Chains B/C are the longer of the two unique sequences, with 367 AA. (fake sequence added below)

seq.fasta:
1 >chainA
2 GCEIGFIPSPVKLENMRLQHEQRAKQHTPPYDVVPSMRPVVLFI
3 >chainB
4 PGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTF
5 >chainC
6 PGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTF

@lhatsk
Copy link
Collaborator

lhatsk commented Jun 22, 2024

Ok, no worries! Yes, the behaviour for B and C is expected. Strange about A, I will run a test.

@Bhaddad93
Copy link

Bhaddad93 commented Jun 22, 2024

I will allow the job to continue to see how far it gets, however the problem appears to persist. For what it is worth, I believe the problem lies in unifold/msa/utils.py get_chain_id_map, I believe that 'mapping' is not being properly built, and that after line 30, the following should be added mapping[unique[seq]].append(chain_id) within the 'if not' statment... then again, when I tried this modification wasn't solved... so maybe it's not the right place to look. I believe this, because when I ask "/unifold/homo_search.py" to print 'chain_mapping' it returns {'B': [C]}, thus it is the only key which is fed into generate_pkl_features.

`(alphalink) Galvani [bhaddad@galvani ~/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB]$ tail -f mpi-err.34171

  • mkdir /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0
  • cp /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/INPUTS/seq.fasta /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0
  • for i in 0 1
  • '[' '!' -d dir_1 ']'
  • mkdir /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_1
  • cp /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/INPUTS/seq.fasta /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_1
  • for i in 0 1
  • sleep 3m
  • CUDA_VISIBLE_DEVICES=0
  • run_alphalink.sh /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq.fasta /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/xlinks.patti.A2B.pkl.gz /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/ /home/bhaddad/alphalink/AlphaLink2/../AlphaLink-Multimer_SDA_v2.pt /home/bhaddad/alphafold/DataBases 2020-05-01
    2024-06-22 17:54:32.316250: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
    To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
    /home/bhaddad/miniconda3/envs/alphalink/lib/python3.10/site-packages/Bio/Data/SCOPData.py:18: BiopythonDeprecationWarning: The 'Bio.Data.SCOPData' module will be deprecated in a future release of Biopython in favor of 'Bio.Data.PDBData.
    warnings.warn(
    I0622 17:54:57.158612 139869933049664 templates.py:945] Using precomputed obsolete pdbs /home/bhaddad/alphafold/DataBases/pdb_mmcif/obsolete.dat.
    I0622 17:54:57.204150 139869933049664 utils.py:74] Mapping multi-chains fasta with chain order: A B C
    I0622 17:54:57.209880 139869933049664 homo_search.py:161] searching homogeneous Sequences & structures for seq_B...
    I0622 17:54:57.215226 139869933049664 jackhmmer.py:140] Launching subprocess "/home/bhaddad/miniconda3/envs/alphalink/bin/jackhmmer -o /dev/null -A /tmp/tmpuhnb4ao0/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq_B.fasta /home/bhaddad/alphafold/DataBases/uniref90/uniref90.fasta"
    I0622 17:54:57.227148 139869933049664 utils.py:36] Started Jackhmmer (uniref90.fasta) query
    `

@lhatsk
Copy link
Collaborator

lhatsk commented Jun 22, 2024

Yeah, I know what the issue is. I pushed a fix. Sorry for the inconvenience! Hope this resolves everything...

@Bhaddad93
Copy link

Indeed it appears to be working! Thank you very much!

@Bhaddad93
Copy link

So a new problem has cropped up that seems perhaps unrelated. I thought to post it here, but also thought it may be appropriate as its own issue.

I am receiving an index-error

I0622 23:21:36.164898 139698480359232 utils.py:75] Mapping multi-chains fasta with chain order: A B C
I0622 23:21:36.170187 139698480359232 homo_search.py:161] searching homogeneous Sequences & structures for seq_A...
I0622 23:21:36.171731 139698480359232 homo_search.py:202] Final timings for seq_A: {}
I0622 23:21:36.174052 139698480359232 homo_search.py:161] searching homogeneous Sequences & structures for seq_B...
I0622 23:21:36.175636 139698480359232 homo_search.py:202] Final timings for seq_B: {}
2024-06-22 23:21:39.438556: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 363, in <module>
    main(args)
  File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 141, in main
    batch = load_feature_for_one_target(
  File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 74, in load_feature_for_one_target
    batch, _ = load_and_process(
  File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 292, in load_and_process
    features, labels = load(**load_kwargs, mode=mode, is_monomer=is_monomer)
  File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 217, in load
    xl = bin_crosslinks(tp,size)
  File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 178, in bin_crosslinks
    xl_[r1,r2,0] = xl_[r2,r1,0] = torch.bucketize(1-fdr, buckets)
IndexError: index 1326 is out of bounds for axis 0 with size 1021
$ ls
dir_0  dir_1  INPUTS  jobscript_colabfold.sh  mpi-err.34172  mpi-err.34173  mpi-out.34172  mpi-out.34173  xlinks.patti.A2B.pkl.gz

$ ls dir_0
A                 A.timings.json    B                 B.timings.json    C.feature.pkl.gz   chains.txt        seq_A.fasta  seq_C.fasta
A.feature.pkl.gz  A.uniprot.pkl.gz  B.feature.pkl.gz  B.uniprot.pkl.gz  chain_id_map.json  C.uniprot.pkl.gz  seq_B.fasta  seq.fasta

$ ls dir_0/A
bfd_uniclust_hits.a3m  mgnify_hits.sto  pdb_hits.sto  uniprot_hits.sto  uniref90_hits.sto
run_alphalink.sh $MYCWD/$DIRBASE$i/${seqList[$i]}	\
                                         $MYCWD/xlinks.patti.A2B.pkl.gz 	\
					 $MYCWD/$DIRBASE$i/ 			\
					 $AFL/../AlphaLink-Multimer_SDA_v2.pt	\
					 /home/bhaddad/alphafold/DataBases	\
					 2020-05-01 &

@lhatsk
Copy link
Collaborator

lhatsk commented Jun 23, 2024

Hm, that would indicate a problem with the crosslinking data. The error says that the (mapped) crosslinking data overshoots the (combined) target length by more than 300 amino acids.

How long are your chains? Is the crosslinking data well aligned with the FASTA sequences? Ie no tags etc.
chains.txt contains A B C, right?

Because I just noticed it: Your crosslinking pickle says A2B but your example above was A1B2, is the chain mapping off by chance?

@Bhaddad93
Copy link

Indeed, this is a case of user-error!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants