-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature.pkl.gz file not found #11
Comments
Hi, Do you maybe run out of memory? |
Thanks for such a prompt reply! That indeed seems to be the problem. We ran another test which maxed out our 32GB of memory and crashed. How much memory do you recommend? |
I honestly have no idea how much memory is generally required. I use a cluster and I'm fine with 80GB most of the time, for larger targets I increase it to 300GB. Maybe switching to the reduced_dbs setting in unifold/homo_search.py can limit the memory consumption. This setting requires the small_bfd database. |
Hi All, firstly thank you for creating alphalink2! Unfortunately, I too am having problems with feature generation, and I have ensured all naming-schemes are consistent. The issue appears to be that it skips feature generation for 'seq_A', but completes for 'seq_B' and 'seq_C'.. input Fasta (seq.fasta):
chains.txt: yet, I get the error that: a) chains.txt isn't found in outdir/seq/, (chains.txt exists in one folder above). b) when I copy chains.txt to outdir/seq/ I get the error that it cannot find A.features.pkl.gz, and indeed it doesn't exist, though B.features.pkl.gz and C.features.pkl.gz exist. I cannot understand why, presumably somewhere in unifold/homo_search.py, it is skipping the first fasta sequence, and going to the second.
|
Hi, Thanks for reporting this! There was a bug with chains.txt that should be fixed now. Please try again (ideally with a fresh output directory). For the rest, it's hard for me to tell what's going on. Chains should only be skipped if they are homomeric but here it would skip B instead of A. Can you share the FASTA file? What does the directory look like? |
Thank you for such a quick response! I am currently performing a fresh install of AlphaLink2, and I will provide an update as soon as it is run! Unfortunately the PI of the project would prefer I keep the sequences 'close to the vest' until publication (currently under review), but I can say with the utmost confidence that chains 'B' and 'C' are identical. What appears to be happening, is it skips 'A' for whatever reason, performs featurization for chain B, then simply copies the features of chain B for chain C (predictably). Chains B/C are the longer of the two unique sequences, with 367 AA. (fake sequence added below) seq.fasta: |
Ok, no worries! Yes, the behaviour for B and C is expected. Strange about A, I will run a test. |
I will allow the job to continue to see how far it gets, however the problem appears to persist. For what it is worth, I believe the problem lies in unifold/msa/utils.py `(alphalink) Galvani [bhaddad@galvani ~/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB]$ tail -f mpi-err.34171
|
Yeah, I know what the issue is. I pushed a fix. Sorry for the inconvenience! Hope this resolves everything... |
Indeed it appears to be working! Thank you very much! |
So a new problem has cropped up that seems perhaps unrelated. I thought to post it here, but also thought it may be appropriate as its own issue. I am receiving an index-error
|
Hm, that would indicate a problem with the crosslinking data. The error says that the (mapped) crosslinking data overshoots the (combined) target length by more than 300 amino acids. How long are your chains? Is the crosslinking data well aligned with the FASTA sequences? Ie no tags etc. Because I just noticed it: Your crosslinking pickle says A2B but your example above was A1B2, is the chain mapping off by chance? |
Indeed, this is a case of user-error! |
Thank you for developing such a promising software for the scientific community.
I tried running Alphalink2 with several datasets, including the rpoa-rpoc dataset used in the bioRxiv manuscript but keep encountering the error below related to the creation of the feature.pgl.gz files. The file is either not created when running a monomeric prediction or only one of the file is created when running a multimeric prediction. I tried several iterations with different file names, fasta header names, crosslink file names, etc (no underscore, no dash, only A B C D... as fasta headers, etc ) but none of these changes solved the problem. I know that this issue has been raised and closed before but the previous fix doesn't seem to fix the problem in this case. Is there a specific naming convention I should adhere to?
The text was updated successfully, but these errors were encountered: