You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to #145, but I think different enough to be its own request. It could be useful in some cases to split molecules based on the presence/absence of an atom or atoms (e.g. put everything that has a F in the test set, put everything that has a halogen in the test set, etc.).
In this case, the user should probably be required to input the atom or atoms that they want excluded into the test set (whereas with a scaffold split and maybe with the functional group split suggested in #145, no user input would be necessary).
Alternatively, #145 could be done where the user inputs a set of functional groups that they want in the test set. In this case, this PR would just be a special case of that where the "functional group" is just a single atom.
The text was updated successfully, but these errors were encountered:
Hi @kevingreenman thanks for the suggestion! I like this idea, and it seems pretty straightforward to implement (i.e. just scan a smiles string for a character, don't even need a RDKit molecule). How do you envision this sampler interacting with the split size arguments (test size, validation size, training size). Would they be ignored?
My opinion is that converting to an RDKit molecule is likely unnecessary for this splitter. I think scanning the SMILES string should be sufficient.
Regarding the split size arguments, I think there's many possible ideas here. I'll list some just to start brainstorming. Additional thoughts are always welcome :)
One idea could be to specify one atom type, identify all molecules with that atom, and then place half in val and half in test. Or perhaps the val and test size arguments could still be used to split these up with different sizes?
But what should the desired behavior be if multiple atom types are specified? Would one/some be restricted to val while the others be restricted to test? Or should all atom types be stratified across val and test?
When I've done this type of splitting manually before, I haven't used any split sizes as inputs. But I could see a variety of ways this could be implemented.
Related to #145, but I think different enough to be its own request. It could be useful in some cases to split molecules based on the presence/absence of an atom or atoms (e.g. put everything that has a F in the test set, put everything that has a halogen in the test set, etc.).
In this case, the user should probably be required to input the atom or atoms that they want excluded into the test set (whereas with a scaffold split and maybe with the functional group split suggested in #145, no user input would be necessary).
Alternatively, #145 could be done where the user inputs a set of functional groups that they want in the test set. In this case, this PR would just be a special case of that where the "functional group" is just a single atom.
The text was updated successfully, but these errors were encountered: