[FEATURE]: split by atom presence/absence #167

kevingreenman · 2024-01-06T01:19:01Z

Related to #145, but I think different enough to be its own request. It could be useful in some cases to split molecules based on the presence/absence of an atom or atoms (e.g. put everything that has a F in the test set, put everything that has a halogen in the test set, etc.).

In this case, the user should probably be required to input the atom or atoms that they want excluded into the test set (whereas with a scaffold split and maybe with the functional group split suggested in #145, no user input would be necessary).

Alternatively, #145 could be done where the user inputs a set of functional groups that they want in the test set. In this case, this PR would just be a special case of that where the "functional group" is just a single atom.

JacksonBurns · 2024-01-08T15:58:12Z

Hi @kevingreenman thanks for the suggestion! I like this idea, and it seems pretty straightforward to implement (i.e. just scan a smiles string for a character, don't even need a RDKit molecule). How do you envision this sampler interacting with the split size arguments (test size, validation size, training size). Would they be ignored?

kspieks · 2024-01-08T16:10:29Z

My opinion is that converting to an RDKit molecule is likely unnecessary for this splitter. I think scanning the SMILES string should be sufficient.

Regarding the split size arguments, I think there's many possible ideas here. I'll list some just to start brainstorming. Additional thoughts are always welcome :)

One idea could be to specify one atom type, identify all molecules with that atom, and then place half in val and half in test. Or perhaps the val and test size arguments could still be used to split these up with different sizes?

But what should the desired behavior be if multiple atom types are specified? Would one/some be restricted to val while the others be restricted to test? Or should all atom types be stratified across val and test?

kevingreenman · 2024-01-08T17:01:31Z

When I've done this type of splitting manually before, I haven't used any split sizes as inputs. But I could see a variety of ways this could be implemented.

kevingreenman added the enhancement New feature or request label Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: split by atom presence/absence #167

[FEATURE]: split by atom presence/absence #167

kevingreenman commented Jan 6, 2024

JacksonBurns commented Jan 8, 2024

kspieks commented Jan 8, 2024

kevingreenman commented Jan 8, 2024

[FEATURE]: split by atom presence/absence #167

[FEATURE]: split by atom presence/absence #167

Comments

kevingreenman commented Jan 6, 2024

JacksonBurns commented Jan 8, 2024

kspieks commented Jan 8, 2024

kevingreenman commented Jan 8, 2024