-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MT5 fine-tuning #5
Comments
Hi, yes this is a private dataset as we cannot distribute the ontonotes data. You can recreate the dataset by setting For convenience I've made it public gated so that you can request access if you already have access to the ontonotes data: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes |
That being said there is not great performance for finetuning MT5 at the base and large sizes (as shown in the paper). At these sizes the model often "hallucinates" mentions which has a compounding negative effect with each sentence. |
I want to try to do Chinese coreference resolution. Did you only use the English data in the ontonotes dataset when fine-tuning? Could you please share your method of processing the ontonotes dataset? I would like to refer to it and implement Chinese. |
I only used English data. The process for generating the data is to run the command in the readme with model set to oracle e.g.: python main.py \
--model_path oracle \
--max_input_size 3000 \
--output_dir $OUTPUT \
--split test \
--batch_size 4 \
--dataset_name preco \
--no_pound_symbol \
--subset 10 \
--subset_start 0 which will generate a final of input/output pairs. Let me know if that works for you. I can try to generate Chinese training data when I get the chance. The Chinese OntoNotes data uniquely has "zero anaphora" annotations, but I believe the training data generation process should still work as above. |
I've generated training data for chinese ontonotes by running the following: cd models/decoder_based/LinkAppend
MODEL_CHECKPOINT=oracle
OUTPUT=~/linkappend_output
mkdir $OUTPUT
python main.py \
--model_path $MODEL_CHECKPOINT \
--max_input_size 3000 \
--output_dir $OUTPUT \
--split train \
--batch_size 1 \
--dataset_name ontonotes_chinese \
--no_pound_symbol The data is available here: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes-chinese Please let me know if it looks correct to you |
When I try to fine-tune the mt5 model, I cannot obtain the 'llm-coref/mt5-coref-ontonotes' dataset. Is this a private dataset? How can I obtain it? Thank you!
The text was updated successfully, but these errors were encountered: