-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about "{d}" and "{p,d}" #5
Comments
@hujunyi96 Thanks for checking the paper! Please take a look at the baseline subsections in Experiments section. |
1.MPCFORMERw/o{d} also constructs the approximated model S‘ but trains’ on D with the task-specific objective, i.e., without distillation. We note that S‘ is initialized with weights in T , i.e., with different functions, whose effect has not been studied. We thus propose a second baseline MPC-FORMERw/o{p,d}, which trains S’on D without distillation, and random weight initialization.(from the baseline subsections in Experiments section); Hello, @DachengLi1 ,I think the two expressions are contradictory, aren't they? |
@hujunyi96 Definitely! Assuming a Bert-Base with CoLA example, Note: all of these three models are S', which uses approximation. Only T uses GeLU+Softmax, if that is confusing. |
hello, @DachengLi1 , when I was trying to use the param "--hidden_act quad" to train baselines with appromations, which are the first major innovation in your paper(the second one would be Distillation), an error occurred: KeyError:'quad'. That means the source code of transformer libs such as 'hidden_act' in BertConfig class don't support the new activation funcs in your paper(exact lib files that cause this error are: xxx/site-package/transformers/activations.py, line 208, in getitem) .
|
@hujunyi96 We have a modified version of Transformers that will do this https://github.com/DachengLi1/MPCFormer/tree/main/transformers. In particular here: MPCFormer/src/main/transformer/modeling.py Line 139 in 38cb42c
Or even simpler, you can just copy paste these several new functions to whereever you want them to be. |
@DachengLi1 I see. I was following the main procedures in README.md in/baselines folder, as you can see the commands listed are actually exectuting run_glue.py, which seems doesn't import [MPCFormer/src/main/transformer/modeling.py] as a module. So I didn't notice the module is already in the project. Thanks for your help! |
How does the command "pip install -e ." executed in path "/MPCFormer/transformers" achieves installing modules in a different path "[/src/main/transformer/]"? |
As is written in your article:.“p” stands for using weights in T as initialization, “d” stands for applying knowledge distillation with T as the teacher.
My question is:Does “using weights in T as initialization” mean fine-tuned model? E.g.”p” stands for “fine-tuning”, namely, 1.MPCBert-B stands for the most basic pre-trained transformer, 2.MPCBert-Bw/o{d} stands for applying KD on the most basic pre-trained transformer, 3. MPCBert-Bw/o{p,d} stands for applying KD on fine-tuned transformer?
The text was updated successfully, but these errors were encountered: