Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning multiple property with partially missing input data #23

Open
TheMatrixMaster opened this issue Jun 11, 2024 · 3 comments
Open

Comments

@TheMatrixMaster
Copy link

Hi @jannisborn, I'm curious whether you explored the setting where input samples may have missing properties, but we still want to learn a conditional generative model over all these properties? I have a dataset where this is the case and I'm interested in exploring opportunities to still leverage the regression transformer setup for training. Any thoughts on where to begin? Thanks!

@jannisborn
Copy link
Member

This is a good question and an interesting aspect. In general, consistency is key to ease learning in the model. So the properties should always be in the same order, ideally their numerical values are in the same scale (we scaled to [0,1] in most cases) and there are no missing values.

So If you have missing values, the natural way is to simply leave out this property from the affected sequences. I assume that this slightly deteriorates performance, my experiments pointed in that direction. But at inference time, you can still provide the affected sequence and either predict the missing property or conditionally generate based on it. This flexibility is build-in.

Alternatively, consider imputing the missing values before launching the finetuning. This makes the data more "consistent" because every sample has the same number of properties. For the property prediction task, managed by PropertyCollator class (see collators.py) you can then control how many tokens are masked. You can in theory set this value to 0, thus this property will be excluded from the property prediction. But note that this works only "globally", so for all samples of the dataset. Happy to consider a PR on this in GT4SD!
For conditional generation, the training collator has a Boolean parameter "do_sample" which allows to sample property values for the generative task. If you impute lots of values for a property consider toggling this on. Again, this can only be controlled on a global level.

Hope this helps for the beginning

@TheMatrixMaster
Copy link
Author

Hey Jannis, thanks for the quick reply. Would this also be a nice way to enable the regression transformer to perform property imputation?

If we exclude missing properties from the loss during the regression task and sample properties for these missing tokens during conditional generation at a per label level (like you suggested), then the regression head could be used to impute the missing labels while potentially benefitting from this multitask learning setup right?

I'm curious how the imputation accuracy would then compare to a straightforward regression model trained solely on the labeled subset of the data.

@jannisborn
Copy link
Member

Yes that would indeed be an interesting strategy! Keep in mind that if you enable the self/cycle consistency loss (--cc_loss flag in the trainer), then this will naturally happen in the generative task: You sample a value for the missing property, then this is used to condition the generation. Then the generated molecule will be passed to the model again, this time in the regression setting, and we measure the deviation between the predicted property value and the sampled one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants