Skip to content

Commit

Permalink
Fixed bug for writer initialized by Chem.SDWriter(...). (#9929)
Browse files Browse the repository at this point in the history
Without the `writer.close()` statement, the file written by `writer`
will not be closed properly. As a result, in our test the end of the
file `/workspace/data/MoleculeGPT/raw/molecules.sdf` is missing. This is
what it looks like:
```
472184
     RDKit          2D

  1  0  0  0  0  0  0  0  0  0999 V2000
    2.0000    0.0000    0.0000 Os  0  0  0  0  0 15  0  0  0  0  0  0
M  CHG  1   1   4
M  END
>  <PUBCHEM_COMPOUND_CID>  (4303)
472184

>  <PUBCHEM_COMPOUND_CANONICALIZED>  (4303)
1

>  <PUBCHEM_CACTVS_COMPLEXITY>  (4303)
0

>  <PUBCHEM_CACTVS_HBOND_ACCEPTOR>  (4303)
0

>  <PUBCHEM_CACTVS_HBOND_DONOR>  (4303)
0

>  <PUBCHEM_CACTVS_ROTATABLE_BOND>  (4303)
0

>  <PUBCHEM_CACTVS_SUBSKEYS>  (4303)
AAADcQAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==

>  <PUBCHEM_IUPAC_OPENEYE_NAME
```
Even the `<PUBCHEM_IUPAC_OPENEYE_NAME` tag does not end correctly with
the `>` character, and the last molecule (#4303) is missing. As a
result, we get a crash later when running the test:
```
Traceback (most recent call last):
  File "/workspace/examples/llm/molecule_gpt.py", line 187, in <module>
    train(
  File "/workspace/examples/llm/molecule_gpt.py", line 69, in train
    dataset = MoleculeGPTDataset(path)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/molecule_gpt_dataset.py", line 217, in __init__
    super().__init__(root, transform, pre_transform, pre_filter,
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/in_memory_dataset.py", line 81, in __init__
    super().__init__(root, transform, pre_transform, pre_filter, log,
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/dataset.py", line 115, in __init__
    self._process()
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/dataset.py", line 262, in _process
    self.process()
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/molecule_gpt_dataset.py", line 436, in process
    CAN_SMILES = mol.GetProp("PUBCHEM_OPENEYE_CAN_SMILES")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'PUBCHEM_OPENEYE_CAN_SMILES'
```
  • Loading branch information
drivanov authored Jan 9, 2025
1 parent 8a651b3 commit ef02854
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions torch_geometric/datasets/molecule_gpt_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,7 @@ def extract_one_SDF_file(block_id: int) -> None:
writer.write(mol)
valid_mol_count += 1

writer.close()
print(f"block id: {block_id}\nfound {valid_mol_count}\n\n")
sys.stdout.flush()
return
Expand Down Expand Up @@ -410,6 +411,7 @@ def extract_one_SDF_file(block_id: int) -> None:
print(f"block id: {block_id} with 0 valid SDF file")
continue

writer.close()
print(f"In total: {len(found_CID_set)} molecules")

# Step 05. Convert to PyG data format
Expand Down

0 comments on commit ef02854

Please sign in to comment.