Update extract_antibody_fv.py #1

mhlee0903 · 2023-08-25T10:48:11Z

I would like to propose a way to solve the bug caused by string index through index sliding of the line of pdb_string.

Why
The truncate_chain method assumes that serial_number, the second section of the line, consists of 5 digits, and handles pdb_string based on the index of the string.
However, if serial_number in the line is composed of more than 6 digits (100000), an error occurs and the case is as follows.

Example line where the bug occur
'ATOM 100000 HA ASP V 50 -84.500 -6.184-184.148 1.00 55.62 H '
Error message

ValueError                                Traceback (most recent call last)
Cell In[30], line 7
      5 if not h_chain == 'NA':
      6     ab_chain += h_chain
----> 7     h_chain_string = truncate_chain(pdb_string, h_chain, 112, 'H')

Cell In[3], line 28, in truncate_chain(pdb_string, chain, limit, chain_id)
     25 if not is_atom:
     26     continue
---> 28 residx = int(line[22:26])
     29 is_target_chain = line[21] == chain
     30 is_fv = residx <= limit

ValueError: invalid literal for int() with base 10: 'V  5'

The truncate_chain method was designed to capture ' 50', but 'V 5' was chosen, since serial_number is 6 digits.

What
Through the slideIdx variable, index sliding of lines is performed without significant changes in the code.
How
slideIdx is allocated as many digits if serial_number is greater than 5 digits.
(slideIdx is 0 if serial_number is less than or equal to 5 digits.)
In an operation based on the string index, slideIdx is added to the currently set line index.
Results
5,361 Fv regions are extracted as pdb files.
Additionally, to parse the 5,361 Fv regions extracted above into a json file...
In order to execute solvent/tools/preprocess_multimer_datasets.py,
_parse_coordinates method in Biopython should be also revised with same way by sliding index.
( The method's link is https://github.com/biopython/biopython/blob/d416809344f1e345fbabbdaca4dd6dcf441e53bd/Bio/PDB/PDBParser.py#L168-L320 )

Sliding index for serial_number>=100000

jmlee4967 · 2023-08-30T00:52:18Z

Thanks for the PR!

I think the PR you suggested can be general implementation for parsing pdb.
However, the case of 6 digits of serial_number is not common format of pdb.

Based on the line-definition of pdb, we(and other libraries like Biopython) assume that serial_number consists of 5 digits.
In other words, a location is specified for each data (residue name, residue sequence number, ...)

We recommend adjusting your data (ex. renumbering technique) to make the library work.

mhlee0903 · 2023-08-30T03:28:44Z

I appreciate your kind and prompt response.

I encountered this bug while executing ABDF's data preprocessing code, which you shared in this context.

As you pointed out, I concur that deviating from the usual PDB format could have a negative impact on the code's future adaptability. Therefore, rectifying the serial number appears to be a more suitable approach.
I intend to proceed with closing the pull request in light of your insight.

I appreciate the provision of high-quality code to facilitate research in AI for protein folding.
I eagerly anticipate the chance to contribute in the future.

Update extract_antibody_fv.py

2045c83

Sliding index for serial_number>=100000

mhlee0903 closed this Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update extract_antibody_fv.py #1

Update extract_antibody_fv.py #1

mhlee0903 commented Aug 25, 2023 •

edited

Loading

jmlee4967 commented Aug 30, 2023

mhlee0903 commented Aug 30, 2023 •

edited

Loading

Update extract_antibody_fv.py #1

Update extract_antibody_fv.py #1

Conversation

mhlee0903 commented Aug 25, 2023 • edited Loading

jmlee4967 commented Aug 30, 2023

mhlee0903 commented Aug 30, 2023 • edited Loading

mhlee0903 commented Aug 25, 2023 •

edited

Loading

mhlee0903 commented Aug 30, 2023 •

edited

Loading