-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research(?) : Alternative missing-value masks #278
Comments
Hello @athewsey, I think there is room for improvement for the
|
Hi @athewsey, I guess most of the questions I would ask are almost the same as @Optimox's. Indeed, I think the problem exists independently of the embeddings and I don't see how we could ever replace by default a numerical mask with something intrinsically more meaningful than 0, indeed allowing custom values seems like an option. @Optimox , what do you mean by your last point? Is it switching the values of the entire column for the batch? |
Thanks both for your insights! Very useful as I try to wrap my head around it all too. To @Optimox 's first point, I think that's my bad: I used "embedding-aware attention" above to refer quite narrowly to a #217 -like implementation (rather than the range of perhaps different ways you could think about doing that)... And also "embedding" to refer quite broadly to the general translation from training dataset
...So although adding "is missing" flag columns for scalars would still double I do hear & agree with the point about zero being intrinsically special as a "no contribution" value at many points in the network (especially e.g. summing up the output contributions and at attention-weighted I wonder if e.g. The idea of swap-based noise rather than masking is also an interesting possibility - I wonder if there's a way it could be implemented that still works nicely & naturally on input datasets with missing values? I'm particularly interested in pre-training as a potential treatment for missing values, since somehow every dataset always seems to be at least a little bit garbage 😂 On the
...But encapsulating this in the PyTorch module itself would hopefully be more easily usable with Of course I guess the above assumes the Obfuscator comes before the "embedding" layer and has no backprop-trainable parameters - I'd have to take a closer look at the new pretraining loss function stuff to understand that a bit more and follow your comments on that & the impact to decoder! |
@eduardocarvp I was just thinking of randomly swapping some columns for each row with another random row (that would probably mean we need to lower the percentage of columns that you swap in order to be able to reconstruct the original input). @athewsey I think I need to have a closer look at all the links you are referring to, I might have missed something. By re-reading the conversation, I think what you are looking for would be to create a new sort of continuous/categorical embeddings. Current embeddings take ints as inputs, those ints refer to the index of the embedding matrix which will be use to pass through the graph. You could change the embedding module to allow non finite values which will go into a specific row of the matrix (this is for categorical features). For continuous features, I wonder if there is a way to do the same thing : if you have a non finite value, then pass through a one dimensional trainable embedding (allowing the model to learn how to represent non finite values) if the value is finite then simply pass it through the network. Wouldn't something like this solve all the problems that you are pointing? (I don't know if it's feasible but I think it should be - but I'm wondering why this does not exist already ^^) |
@eduardocarvp something like this (from https://www.kaggle.com/davidedwards1/tabularmarch21-dae-starter) :
|
Feature request
The current
RandomObfuscator
implementation (in line with the original paper, if I understand correctly) masks values by setting them to 0.But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token
THE
as your[MASK]
for an English text model pre-training task.I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).
What is the expected behavior?
I suggest two primary options:
Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality... Whereas if it's done in a model-aware way results could be much better.
What is motivation or use case for adding/changing the behavior?
I've lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven't yet bothered to "fix" to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model loves to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).
As previously noted on a different issue, I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is here).
...So I'm super-suspicious from background playing with this dataset, that the poor pre-training losses I'm currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked... And have seen some good performance from the flag-column treatment in past testing.
How should this be implemented in your opinion?
RandomObfuscator
to use a non-finite value likenan
as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps inX
.Are you willing to work on this yourself?
yes
The text was updated successfully, but these errors were encountered: