-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning multiple property with partially missing input data #23
Comments
This is a good question and an interesting aspect. In general, consistency is key to ease learning in the model. So the properties should always be in the same order, ideally their numerical values are in the same scale (we scaled to [0,1] in most cases) and there are no missing values. So If you have missing values, the natural way is to simply leave out this property from the affected sequences. I assume that this slightly deteriorates performance, my experiments pointed in that direction. But at inference time, you can still provide the affected sequence and either predict the missing property or conditionally generate based on it. This flexibility is build-in. Alternatively, consider imputing the missing values before launching the finetuning. This makes the data more "consistent" because every sample has the same number of properties. For the property prediction task, managed by PropertyCollator class (see collators.py) you can then control how many tokens are masked. You can in theory set this value to 0, thus this property will be excluded from the property prediction. But note that this works only "globally", so for all samples of the dataset. Happy to consider a PR on this in GT4SD! Hope this helps for the beginning |
Hey Jannis, thanks for the quick reply. Would this also be a nice way to enable the regression transformer to perform property imputation? If we exclude missing properties from the loss during the regression task and sample properties for these missing tokens during conditional generation at a per label level (like you suggested), then the regression head could be used to impute the missing labels while potentially benefitting from this multitask learning setup right? I'm curious how the imputation accuracy would then compare to a straightforward regression model trained solely on the labeled subset of the data. |
Yes that would indeed be an interesting strategy! Keep in mind that if you enable the self/cycle consistency loss ( |
Hi @jannisborn, I'm curious whether you explored the setting where input samples may have missing properties, but we still want to learn a conditional generative model over all these properties? I have a dataset where this is the case and I'm interested in exploring opportunities to still leverage the regression transformer setup for training. Any thoughts on where to begin? Thanks!
The text was updated successfully, but these errors were encountered: