Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update extract_antibody_fv.py #1

Closed
wants to merge 1 commit into from

Conversation

mhlee0903
Copy link

@mhlee0903 mhlee0903 commented Aug 25, 2023

I would like to propose a way to solve the bug caused by string index through index sliding of the line of pdb_string.

  1. Why
    The truncate_chain method assumes that serial_number, the second section of the line, consists of 5 digits, and handles pdb_string based on the index of the string.
    However, if serial_number in the line is composed of more than 6 digits (100000), an error occurs and the case is as follows.
  • Example line where the bug occur
    'ATOM 100000 HA ASP V 50 -84.500 -6.184-184.148 1.00 55.62 H '

  • Error message

ValueError                                Traceback (most recent call last)
Cell In[30], line 7
      5 if not h_chain == 'NA':
      6     ab_chain += h_chain
----> 7     h_chain_string = truncate_chain(pdb_string, h_chain, 112, 'H')

Cell In[3], line 28, in truncate_chain(pdb_string, chain, limit, chain_id)
     25 if not is_atom:
     26     continue
---> 28 residx = int(line[22:26])
     29 is_target_chain = line[21] == chain
     30 is_fv = residx <= limit

ValueError: invalid literal for int() with base 10: 'V  5'

The truncate_chain method was designed to capture ' 50', but 'V 5' was chosen, since serial_number is 6 digits.

  1. What
    Through the slideIdx variable, index sliding of lines is performed without significant changes in the code.

  2. How
    slideIdx is allocated as many digits if serial_number is greater than 5 digits.
    (slideIdx is 0 if serial_number is less than or equal to 5 digits.)
    In an operation based on the string index, slideIdx is added to the currently set line index.

  3. Results
    5,361 Fv regions are extracted as pdb files.

  4. Additionally, to parse the 5,361 Fv regions extracted above into a json file...
    In order to execute solvent/tools/preprocess_multimer_datasets.py,
    _parse_coordinates method in Biopython should be also revised with same way by sliding index.
    ( The method's link is https://github.com/biopython/biopython/blob/d416809344f1e345fbabbdaca4dd6dcf441e53bd/Bio/PDB/PDBParser.py#L168-L320 )

Sliding index for serial_number>=100000
@jmlee4967
Copy link
Contributor

Thanks for the PR!

I think the PR you suggested can be general implementation for parsing pdb.
However, the case of 6 digits of serial_number is not common format of pdb.

Based on the line-definition of pdb, we(and other libraries like Biopython) assume that serial_number consists of 5 digits.
In other words, a location is specified for each data (residue name, residue sequence number, ...)
pdb_format

We recommend adjusting your data (ex. renumbering technique) to make the library work.

@mhlee0903
Copy link
Author

mhlee0903 commented Aug 30, 2023

I appreciate your kind and prompt response.

I encountered this bug while executing ABDF's data preprocessing code, which you shared in this context.

As you pointed out, I concur that deviating from the usual PDB format could have a negative impact on the code's future adaptability. Therefore, rectifying the serial number appears to be a more suitable approach.
I intend to proceed with closing the pull request in light of your insight.

I appreciate the provision of high-quality code to facilitate research in AI for protein folding.
I eagerly anticipate the chance to contribute in the future.

@mhlee0903 mhlee0903 closed this Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants