Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Without the `writer.close()` statement, the file written by `writer` will not be closed properly. As a result, in our test the end of the file `/workspace/data/MoleculeGPT/raw/molecules.sdf` is missing. This is what it looks like: ``` 472184 RDKit 2D 1 0 0 0 0 0 0 0 0 0999 V2000 2.0000 0.0000 0.0000 Os 0 0 0 0 0 15 0 0 0 0 0 0 M CHG 1 1 4 M END > <PUBCHEM_COMPOUND_CID> (4303) 472184 > <PUBCHEM_COMPOUND_CANONICALIZED> (4303) 1 > <PUBCHEM_CACTVS_COMPLEXITY> (4303) 0 > <PUBCHEM_CACTVS_HBOND_ACCEPTOR> (4303) 0 > <PUBCHEM_CACTVS_HBOND_DONOR> (4303) 0 > <PUBCHEM_CACTVS_ROTATABLE_BOND> (4303) 0 > <PUBCHEM_CACTVS_SUBSKEYS> (4303) AAADcQAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA== > <PUBCHEM_IUPAC_OPENEYE_NAME ``` Even the `<PUBCHEM_IUPAC_OPENEYE_NAME` tag does not end correctly with the `>` character, and the last molecule (#4303) is missing. As a result, we get a crash later when running the test: ``` Traceback (most recent call last): File "/workspace/examples/llm/molecule_gpt.py", line 187, in <module> train( File "/workspace/examples/llm/molecule_gpt.py", line 69, in train dataset = MoleculeGPTDataset(path) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/molecule_gpt_dataset.py", line 217, in __init__ super().__init__(root, transform, pre_transform, pre_filter, File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/in_memory_dataset.py", line 81, in __init__ super().__init__(root, transform, pre_transform, pre_filter, log, File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/dataset.py", line 115, in __init__ self._process() File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/dataset.py", line 262, in _process self.process() File "/usr/local/lib/python3.12/dist-packages/torch_geometric/datasets/molecule_gpt_dataset.py", line 436, in process CAN_SMILES = mol.GetProp("PUBCHEM_OPENEYE_CAN_SMILES") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyError: 'PUBCHEM_OPENEYE_CAN_SMILES' ```
- Loading branch information