Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MT5 fine-tuning #5

Open
Gypsophila1006 opened this issue Sep 24, 2024 · 5 comments
Open

MT5 fine-tuning #5

Gypsophila1006 opened this issue Sep 24, 2024 · 5 comments

Comments

@Gypsophila1006
Copy link

When I try to fine-tune the mt5 model, I cannot obtain the 'llm-coref/mt5-coref-ontonotes' dataset. Is this a private dataset? How can I obtain it? Thank you!

@ianporada
Copy link
Owner

Hi, yes this is a private dataset as we cannot distribute the ontonotes data. You can recreate the dataset by setting --model_path as oracle in which case the inference code will generate the training data.

For convenience I've made it public gated so that you can request access if you already have access to the ontonotes data: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes

@ianporada
Copy link
Owner

That being said there is not great performance for finetuning MT5 at the base and large sizes (as shown in the paper). At these sizes the model often "hallucinates" mentions which has a compounding negative effect with each sentence.

@Gypsophila1006
Copy link
Author

I want to try to do Chinese coreference resolution. Did you only use the English data in the ontonotes dataset when fine-tuning? Could you please share your method of processing the ontonotes dataset? I would like to refer to it and implement Chinese.

@ianporada
Copy link
Owner

I only used English data. The process for generating the data is to run the command in the readme with model set to oracle e.g.:

python main.py \
    --model_path oracle \
    --max_input_size 3000 \
    --output_dir $OUTPUT \
    --split test \
    --batch_size 4 \
    --dataset_name preco \
    --no_pound_symbol \
    --subset 10 \
    --subset_start 0

which will generate a final of input/output pairs.

Let me know if that works for you. I can try to generate Chinese training data when I get the chance. The Chinese OntoNotes data uniquely has "zero anaphora" annotations, but I believe the training data generation process should still work as above.

@ianporada
Copy link
Owner

I've generated training data for chinese ontonotes by running the following:

cd models/decoder_based/LinkAppend 

MODEL_CHECKPOINT=oracle
OUTPUT=~/linkappend_output
mkdir $OUTPUT

python main.py \
    --model_path $MODEL_CHECKPOINT \
    --max_input_size 3000 \
    --output_dir $OUTPUT \
    --split train \
    --batch_size 1 \
    --dataset_name ontonotes_chinese \
    --no_pound_symbol

The data is available here: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes-chinese

Please let me know if it looks correct to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants