-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras implementation? #14
Comments
I see two options, but I'm not knowledgeable enough with Keras to judge how easy they would be. First, you could define a GradientReversal layer in Keras - this PR attempted that in the Keras 1.0 API, but was never finished. I believe this is feasible and would be best in terms of usability by other Keras users. Second, I believe that Keras 2.0 and TF can interoperate relatively seamlessly, so you could use Keras to define most of your model and use TF for the gradient reversal portion. Here is a simple example of mixing the two. Again, I haven't used this myself, but it seems like a feasible option. |
Adding to pumpikano's answer, you can implement gradient reversal layer as 'maximizing' the objective of interest rather than minimization. This can be implemented as minimizing the negative of the cost function in Tensorflow, and could be very easy with Keras also. So your objective may look like this: min (object_cost - domain_cost) where you favor good object recognition (object_cost) and hurt your model's ability to differentiate images from two different domains (domain_cost). This leaves no need for explicitly implementing Gradient Reversal Layer, as it also achieves the same objective after all. |
@pumpikano Thanks, that looks feasible - I'll take a look and keep you updated. @kilickaya That's true! Should I then basically just define that as a custom loss function for the domain adaptation part? Would that produce the same output as the paper? |
In the paper, the Gradient Reversal Layer is used to go in reverse direction of Gradients. This can be achievable in two ways. You can either reverse the sign of gradients and minimize the same objective, or you can reverse the loss function, as I mentioned above. Implementation wise it is slightly different, but they serve for the same purpose (optimizes the same objective). Either the authors did not recognize that while writing the paper, that this idea is that simple to implement, or they wanted to make it look more complex than necessary. In their later manuscript, they also state that this can be implemented by maximization. I don't know about Keras, but you can implement in Tensorflow like this (assume you use Adam optimizer): optimizer = tf.train.AdamOptimizer(learning_rate=learningrate).minimize(object_cost - domain_cost, var_list= vars) |
Just to clarify, you will need two alternating optimizations with this approach min_D(domain_cost) where D is the domain prediction subnetwork, P is the class prediction subnetwork, and min_X means "minimize w.r.t. function X". This is a classic GAN setup. It is worth noting that it also gives you the freedom to allow the domain cost to differ between the two optimizations. For instance, many GAN implementations actually minimize the discriminator objective with flipped labels rather than maximizing it with correct labels. This amounts to a different objective and usually works better in practice in GANS. |
Could I just define a custom loss such as:
? @pumpikano what exactly do you mean by 'alternating'? I would just use one cost function in the label branch and the other one in the domain subbranch, correct? |
I just mean that usually you will take a step with one loss function and then a step with the other for each batch, usually updating the discriminator first. |
@pumpikano So I managed to implement the layer for the TF backend expanding on the link you provided, you can find it here: https://github.com/michetonu/gradient_inversion_keras_tf/blob/master/flipGradientTF.py Regarding the loss function steps, Keras works with multi-output models, and I think the loss functions are just additive. Correct me if I'm wrong, but I think I can just create one model (it seems to work). The way I'm doing it is I'm feeding both distributions as input, but setting the target samples' loss weights to zero in the classifier, so that they don't get considered in the backprop update, while removing the need to alternate inputs. I'll happily share my code once I'm sure it works properly :) |
Cool! Yeah that seems like a reasonable approach. |
@michetonu Did you got your code working properly? Would love a Keras DANN implementation example. (Especially how you set the target samples' loss weights to zero in the classifier) |
@Wojova yes! Here is the gradient inversion layer: https://github.com/michetonu/gradient_reversal_keras_tf For the sample weights you just need to create your input batches by alternating samples from the source and target domain, and pass a sample_weights array of 1s and 0s accordingly. |
Thanks! |
Responding to some of the previous comments on here, I don't think that simply reversing the loss function (on the domain part of the network) is a replacement for the gradient reversing layer. The reason is that simply maximising the domain loss won't necessarily lead to domain-invariant features in the shared feature layer, as the weights of the domain-classifier layer will be able to give a bad loss even if the shared feature layer isn't domain invariant. |
@erlendd if you read the original al paper by Ganin et al. (and the few other papers implementing the same domain adversarial training approach), that's exactly what the gradient inversion layer does. It does nothing on the forward pass, and it multiplies the loss by a negative constant during backprop. |
@michetonu yeah I know, I was replying to @kilickaya 's comment above. Am I correct in saying that the implementation of DANN here using the target labels during training? I.e. it isn't unsupervised? |
@erlendd it is unsupervised if you set the target's loss_weights to 0. That way they will still contribute to the accuracy (or whatever measure you choose) shown by Keras, but they won't contribute the loss of the first classifier, which is what you want. Basically the labels are there but don't matter. You can actually check that this works by randomising the target's labels during training and getting the same performance (using the real labels at test time) |
@michetonu sorry I was referring to the TF code here: https://github.com/pumpikano/tf-dann/blob/master/Blobs-DANN.ipynb. I don't see a loss_weights variable. |
@erlendd ah sorry, the trick is how they create the batches and run the network. They alternate the branches rather than doing forward and back prop simultaneously summing the losses. On the top branch only sample from the source domain are used for training, while the bottom branch receives both. The code is not super clear at first sight (took me a while as well) but that's how their generator works. |
Thanks, I get how it works now, but isn't there an issue with the batches not being fully shuffled? For example the first half of each batch is always the "source" set and the second half is the "target" set. I think the only way to get around this would be to use a Boolean mask, but I'm unsure. |
@erlendd just shuffle the domains separately before you create the batches! It shouldn't matter if they are split half-half once they go through the net, right? But in any case, you could even shuffle the batches one by one. |
@michetonu well the domains are already shuffled inside the batch generator code, so that's not an issue. The issue I think is that the training batch is constructed such that the first half is from the source domain and the second half is from the target domain, so with respect to the domain classifier the data has not been shuffled. Probably it won't matter if you use a smaller batch size, but if you used a much larger batch size there could be an issue, I guess. Also I don't think it's possible to simply shuffle the training batches, as the tensorflow model here assumes that the first half of the batch is from the source domain and the second half is from the target domain. That's why I was suggesting that it could possibly be fixed using a mask in place of tf.cond in the code. |
@erlendd you are right that they are already shuffled! I wasn't looking at the code. It does not matter whether one half is always first - the model does not "remember" the labels of previous samples, so the order of training samples in each batch doesn't really matter IMHO. |
@michetonu I could be wrong, but I believe it does matter if the first half of samples always come from one distribution and the second half from another distribution - otherwise why would there be any necessity to shuffle in a simpler neural network model? I'm currently re-writing using a boolean mask so will check if this makes any sort of a difference. |
@erlendd my understanding was that shuffling prevents having the same train-test split every time, but I might be wrong! |
@michetonu you're right - order of training samples within a batch doesn't matter. |
The reason that the order of examples within a batch does not matter is that the loss of the batch is the sum (really, a normalized sum, i.e. mean) of the losses of the examples. The gradient of a sum is the sum of the gradients of the terms, and addition is commutative, so order doesn't matter. Of course, the batches need to be fair samples in order for the gradient of each minibatch loss to approximate the gradient of the full training set loss. An easy algorithm for getting unbiased minibatches and insuring that all examples are used is to shuffle the training examples and take minibatches sequentially until all examples are used, then repeat. Order matters in the sense that the shuffle is the reason this algorithm creates correctly sampled minibatches. Hope this helps! |
Great explanation! |
@pumpikano I'm a bit confused after erlendd's comment... Just to be clear, you don't necessarily need the gradient reversal layer as long as you reverses the domain_cost in the final objective (total_cost), correct? |
@qianyizhang reversing domain_cost at the final objective does do the same thing as a GRL. If you just maximise the domain cost you won't know if it's because the shared feature layer has invariance or if the final layer of the domain classifier is bad at separating the domains. |
@erlendd I think you meant to say "does not do the same thing as a GRL"? In any case, looking at the diagram from the paper might be helpful. Descent on the parameters of G_d (the domain classifier) are minimizing domain classification loss, whereas descent on the parameters of G_f (the feature extractor) are maximizing domain classification loss because the gradients were reversed. If you removed the GRL and invert the domain cost, descent on parameters of G_d and parameters of G_f would both be maximizing domain cost. |
@erlendd |
I see how setting the sample_weights to zero for the target will prevent the classifier from updating, but doesn't it also keep the domain classifier from updating? Can't you not use different sample_weights for different outputs? |
@tmullen93 if your model has two outputs you can define sample_weights as a 2D array such as [[1,1,1...0,0,0], None], one for each output. |
@michetonu Ahh that makes sense thank you! What do you mean by make your batch alternate through source and target? Does this mean making a generator and forcing it to build a batch that is equal? Could this be why when I take your suggestion for the sample weight but don't make a generator to make the even batch my loss becomes NaN? I should also note that I'm trying to do multivariate regression instead of classification. |
@tmullen93 just create mini-batches manually (with a generator or otherwise) in which target and source observations are alternated 50/50 (make sure to do the same for the labels, and careful if you re-shuffle), and then set training batch_size to a multiple of your mini-batch size. So for example if alternate target and source observations 1 by 1, you can choose any power of 2 for your batch size, as it will have the same proportion of target/source samples. |
What can I do if I have multiple domains I'd like to account for? How would I set sample_weights? Arbitrarily choose a source domain and set everything else to one? |
@GlastonburyC I tried this method on data with about 100+ domains, and a single target domain. Trying to use many source domains at once with this method doesn't really work (the network can't learn very well). You might be able to use an unsupervised method (e.g. dAE, TCA) to bring the source domains closer together, but I guess it's still an open question. |
Thanks for the response! Could you show me an example of how you set the sample weights for multiple domains?
On 22 Jan 2018 9:00 am, Erlend Davidson <[email protected]> wrote:
@GlastonburyC<https://github.com/glastonburyc> I tried this method on data with about 100+ domains, and a single target domain. Trying to use many source domains at once with this method doesn't really work (the network can't learn very well). You might be able to use an unsupervised method (e.g. dAE, TCA) to bring the source domains closer together, but I guess it's still an open question.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#14 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEA_S0rCUQNx1Ia-6H2bwVuODeU3K37Kks5tNE4agaJpZM4OW-4K>.
|
@GlastonburyC you mean if you wanted to append a 'weight' column to your data, where that weight might be different for each domain? I don't have the code in front of me but I did it using Pandas. If the training covariates are stored in a dataframe called df you can add a new column doing:
If you want to make it conditional on the domain, you can use this:
|
@michetonu I'm performing image segmentation, so I've bolted on a domain classifier with a gradient reversal layer. if my batch is 50 / 50 'source' and 'target' images and I don't set any sample_weights, is the gradient reversal layer sitting between the domain classifier and my segmentation network equivalent to forcing my segmentation network to learn domain invariant features? I don't necessarily have a 'source' and 'target' I just want the segmentation network to work on a test set that maybe from a different distribution to my input? |
@GlastonburyC if you don't 'hide' the target samples from the classifier, you are not performing unsupervised domain adaptation anymore. It will probably still help, but I think there are better ways to do this if you have the labels for your 'target' dataset. |
My thought process is that you don't sometimes know your test data's target domain. Therefore if you negate the gradient of the domain classifier for both source and target samples ( where the source and target are merely two distributions from a possible population of distributions) the classifier network is forced to learn domain invariant features? Is that logic correct? Would appreciate it if you pointed me to a paper on a better method! :) Cheers for the help. |
Thanks for this contribution. Previously we have implemented this in a very difficult way by having two different models with tied encoder weights but separate loss functions, this is way more elegant and less error-prone. However I'm not sure what the hp_lambda stands for. What are we supposed to pass there? |
It's a scaling factor of how much you want the reversed domain gradient to be added to the classifiers weight updates (where 1 is the gradient, 0.5 is half a gradient update) etc.
On 7 Feb 2018 12:43 pm, Mustafa Radha <[email protected]> wrote:
Thanks for this contribution. Previously we have implemented this in a very difficult way by having two different models with tied encoder weights but separate loss functions, this is way more elegant and less error-prone.
However I'm not sure what the hp_lambda stands for. What are we supposed to pass there?
Mustafa
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#14 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEA_S0ZaPlyZPZ-dOtJOqlaU_7v9m5Bpks5tSZp6gaJpZM4OW-4K>.
|
Right, thanks for the explanation (and quick response). I suspected something like this since it was multiplied with the gradient, good to know for sure. Going to try it out now :) |
By the way, as an alternative to passing custom sample weights to do the iterative training of the two targets, you could probably just compile two models.
A simple minimax training would then look something like this:
Or am I overseeing something? EDIT: seems to be working. The convenience is that when you're done you can just take the classifier_trainer model and use that as the "final model" for evaluation |
I just wondered how one would modify the code to do the multiple-source variant of this. I.e. following this paper: https://arxiv.org/abs/1705.09684. It isn't clearly explained how the batches are constructed with multiple source domains. |
@michetonu I have followed the same strategy as you regarding the network architecture: one model with two outputs.
Since I am working with the fit_generator function, I have implemented the following custom loss instead of using sample weights:
That makes sense to me because I create mini-batches manually alternating 50-50 source and target samples. However, I am not getting domain adaptation, even if I change hyperparameters. That's why I wonder if this custom loss implementation is correct. I'd really appreciate any help. Thanks! |
@amn-gti-upm Sorry for the late reply! |
@michetonu As I pointed out in my previous comment, I used one model with two outputs: the label predictor (LP) and the domain classifier (DC). I used the
These two losses are additive. In order to maximize the domain classifier loss, i.e. the binary cross-entropy, I used your Gradient Reversal Layer (GRL) implementation. I finally got domain adaptation with this implementation, but fixing the lambda parameter at 1. |
Hey,
First of all thanks a lot for this. I was wondering whether there is an easy way to make the gradient flipping work in Keras. Someone has done it for the Theano backend, but not for the Tensorflow. Would it be feasible to combine the two?
Thanks!
The text was updated successfully, but these errors were encountered: