Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDB loader throws errors - Note: fixed in release branch #3054

Open
butlerpd opened this issue Aug 17, 2024 · 6 comments
Open

PDB loader throws errors - Note: fixed in release branch #3054

butlerpd opened this issue Aug 17, 2024 · 6 comments
Labels
Critical High priority Discuss At The Call Issues to be discussed at the fortnightly call SasView 6.0.0 Required for 6.0.0 release

Comments

@butlerpd
Copy link
Member

Describe the bug
Loading a PDB in the GSC throws a bunch of read errors into the log file. It seems like the program may be expecting to find atoms named C, N, O, which are in the last column (though not even available in all PDB files) but in some cases at least it seems to be reading the identity column which has things like CA, CB, CG, H1 etc. (alpha carbon etc), and there are many variations of these included in the PDB. But it does not seem to be consistent. All PDB files read in seem to have a slew of

17:57:06 - ERROR: string index out of range. 

Others however also have a bunch of warning lines like:

17:57:06 - WARNING: Warning: set the sld of Cg to zero. 

In those cases there seems to be at least one traceback error after the read (when it now auto calculates the Rg) though that may be a function of the fact that both files tested had the hydrogen atoms included and in some case two. The second error is what showed up when only one error was generated, while this is the order in the second case. Other types of files may cause different errors/permutations:

17:57:06 - ERROR: Traceback (most recent call last): File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 560, in change_data_type File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 912, in update_gui File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 923, in update_Rg File "sas\sascalc\calculator\geni.py", line 494, in radius_of_gyration File "sas\sascalc\calculator\geni.py", line 557, in center_of_mass File "periodictable\core.py", line 270, in symbol ValueError: unknown element H1
17:57:06 - ERROR: Traceback (most recent call last): File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 812, in complete_loading File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 912, in update_gui File "sas\qtgui\Calculators\GenericScatteringCalculator.py", line 923, in update_Rg File "sas\sascalc\calculator\geni.py", line 494, in radius_of_gyration File "sas\sascalc\calculator\geni.py", line 557, in center_of_mass File "periodictable\core.py", line 270, in symbol ValueError: unknown element H1

Given the many WARNINGSaboutindex out of range` or setting the SLD of the various elements to zero then running anyway is rather concerning. In particular if the calculation is ignoring atoms and/or setting them their sld to zero would cause incorrect answers.

To Reproduce
Steps to reproduce the behavior:

  1. Go to the Generic Scattering Calculator in the tools menu
  2. Load one of the attached PDB files (each produces a different variant as described above)
  3. Check the log explorer
  4. See above warnings and/or errors.

Expected behavior
The file should be read in correctly, OR throw an exception that warns the user that the PDB cannot be read (and do not load it)

SasView version (please complete the following information):

  • Version: 6.0.0b2

Operating system (please complete the following information):

  • OS: Windows 10

Additional context
Identified during NIST CNR summer school and thus being labelled 6.0.0. Whether the error existed in 5.0.6 has not yet been verified. However, if it is silently producing wrong answers I would argue fixing it now, if possible, would still be appropriate

FILES to REPRODUCE
NOTE: these should be renamed as *.pdb before using. Github does not allow that extension hence the change to .txt here

@butlerpd butlerpd added Critical High priority SasView 6.0.0 Required for 6.0.0 release labels Aug 17, 2024
@wpotrzebowski
Copy link
Contributor

#2968 seems to be related

@krzywon
Copy link
Contributor

krzywon commented Aug 19, 2024

The string index out of range errors are not new to v6.0.0, but the other errors are. If they are related, then we should fix them all for the release. Otherwise, the index error should wait for a point release.

@juliuskarliczek juliuskarliczek self-assigned this Aug 21, 2024
@lucas-wilkins
Copy link
Contributor

I fixed it before, just didn't get to the beta. @timsnow has confirmed.

@lucas-wilkins lucas-wilkins changed the title PDB loader throws errors PDB loader throws errors - Note: fixed in release branch Aug 21, 2024
@juliuskarliczek juliuskarliczek removed their assignment Aug 21, 2024
@butlerpd
Copy link
Member Author

Actually the fix only fixes the first error string index out of range and only for the first example file. However, the problem with the other files is that they do not conform (in a very subtle way) to the currently published PDB format v 3.3, which states that the element name is right justified. in the other two example files they are left justified.

Apparently, there is a lot of history of violent arguments about what the correct format should be (or what should be a valid PDB file). For our purposes, as long as we are "rolling our own" I suggest we stick to the recommended v3.3 standard. We should however do a lot better job of using the exception to not load the file and send a message to the log that the pdb does not match the v3.3 standard definition. This would remove the problem of later hanging when pressing compute while giving more useful (and less verbose) feedback to the user.

@lucas-wilkins lucas-wilkins reopened this Sep 5, 2024
@lucas-wilkins
Copy link
Contributor

@butlerpd Unfortunately, it's pretty complicated. I think whatever the solution, it's going to be hard to get it to work automatically, as the extent and manner of what is represented in a PDB can vary quite a lot. The most obvious thing being whether H is present or not. There are other things too though, like HOH entries, and the alpha, beta, gamma carbons etc.

@krzywon krzywon added the Discuss At The Call Issues to be discussed at the fortnightly call label Sep 19, 2024
@butlerpd
Copy link
Member Author

I agree. see #3055 (comment) Your previous fix @lucas-wilkins does make the reader read anything that conform properly to the current published standard.

However, as mentioned in that issue also (#3055 (comment)) I think we can easily prevent the hanging when reading "malformed" files.

For a later discussion would be I think: Do we just say we will only deal with files that strictly adhere to the standard and provide all the atoms of interest or do we try to use a python reader (e.g. the Bio.PDB module of the biopython.py package - BSD-3 license) to handle the reading more elegantly? Personally I think documenting ourselves as "strict" would be the way to go until someone who feels it is important to their science to be more forgiving decides to get involved and contribute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical High priority Discuss At The Call Issues to be discussed at the fortnightly call SasView 6.0.0 Required for 6.0.0 release
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

5 participants