Learning multiple property with partially missing input data #23

TheMatrixMaster · 2024-06-11T21:13:24Z

Hi @jannisborn, I'm curious whether you explored the setting where input samples may have missing properties, but we still want to learn a conditional generative model over all these properties? I have a dataset where this is the case and I'm interested in exploring opportunities to still leverage the regression transformer setup for training. Any thoughts on where to begin? Thanks!

jannisborn · 2024-06-11T21:50:02Z

This is a good question and an interesting aspect. In general, consistency is key to ease learning in the model. So the properties should always be in the same order, ideally their numerical values are in the same scale (we scaled to [0,1] in most cases) and there are no missing values.

So If you have missing values, the natural way is to simply leave out this property from the affected sequences. I assume that this slightly deteriorates performance, my experiments pointed in that direction. But at inference time, you can still provide the affected sequence and either predict the missing property or conditionally generate based on it. This flexibility is build-in.

Alternatively, consider imputing the missing values before launching the finetuning. This makes the data more "consistent" because every sample has the same number of properties. For the property prediction task, managed by PropertyCollator class (see collators.py) you can then control how many tokens are masked. You can in theory set this value to 0, thus this property will be excluded from the property prediction. But note that this works only "globally", so for all samples of the dataset. Happy to consider a PR on this in GT4SD!
For conditional generation, the training collator has a Boolean parameter "do_sample" which allows to sample property values for the generative task. If you impute lots of values for a property consider toggling this on. Again, this can only be controlled on a global level.

Hope this helps for the beginning

TheMatrixMaster · 2024-06-11T23:59:31Z

Hey Jannis, thanks for the quick reply. Would this also be a nice way to enable the regression transformer to perform property imputation?

If we exclude missing properties from the loss during the regression task and sample properties for these missing tokens during conditional generation at a per label level (like you suggested), then the regression head could be used to impute the missing labels while potentially benefitting from this multitask learning setup right?

I'm curious how the imputation accuracy would then compare to a straightforward regression model trained solely on the labeled subset of the data.

jannisborn · 2024-06-12T07:23:16Z

Yes that would indeed be an interesting strategy! Keep in mind that if you enable the self/cycle consistency loss (--cc_loss flag in the trainer), then this will naturally happen in the generative task: You sample a value for the missing property, then this is used to condition the generation. Then the generated molecule will be passed to the model again, this time in the regression setting, and we measure the deviation between the predicted property value and the sampled one

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning multiple property with partially missing input data #23

Learning multiple property with partially missing input data #23

TheMatrixMaster commented Jun 11, 2024

jannisborn commented Jun 11, 2024

TheMatrixMaster commented Jun 11, 2024

jannisborn commented Jun 12, 2024

Learning multiple property with partially missing input data #23

Learning multiple property with partially missing input data #23

Comments

TheMatrixMaster commented Jun 11, 2024

jannisborn commented Jun 11, 2024

TheMatrixMaster commented Jun 11, 2024

jannisborn commented Jun 12, 2024