MT5 fine-tuning #5

Gypsophila1006 · 2024-09-24T10:17:02Z

When I try to fine-tune the mt5 model, I cannot obtain the 'llm-coref/mt5-coref-ontonotes' dataset. Is this a private dataset? How can I obtain it? Thank you!

ianporada · 2024-09-24T11:49:08Z

Hi, yes this is a private dataset as we cannot distribute the ontonotes data. You can recreate the dataset by setting --model_path as oracle in which case the inference code will generate the training data.

For convenience I've made it public gated so that you can request access if you already have access to the ontonotes data: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes

ianporada · 2024-09-24T11:51:08Z

That being said there is not great performance for finetuning MT5 at the base and large sizes (as shown in the paper). At these sizes the model often "hallucinates" mentions which has a compounding negative effect with each sentence.

Gypsophila1006 · 2024-09-25T02:22:04Z

I want to try to do Chinese coreference resolution. Did you only use the English data in the ontonotes dataset when fine-tuning? Could you please share your method of processing the ontonotes dataset? I would like to refer to it and implement Chinese.

ianporada · 2024-09-25T03:21:59Z

I only used English data. The process for generating the data is to run the command in the readme with model set to oracle e.g.:

python main.py \
    --model_path oracle \
    --max_input_size 3000 \
    --output_dir $OUTPUT \
    --split test \
    --batch_size 4 \
    --dataset_name preco \
    --no_pound_symbol \
    --subset 10 \
    --subset_start 0

which will generate a final of input/output pairs.

Let me know if that works for you. I can try to generate Chinese training data when I get the chance. The Chinese OntoNotes data uniquely has "zero anaphora" annotations, but I believe the training data generation process should still work as above.

ianporada · 2024-09-25T06:26:55Z

I've generated training data for chinese ontonotes by running the following:

cd models/decoder_based/LinkAppend 

MODEL_CHECKPOINT=oracle
OUTPUT=~/linkappend_output
mkdir $OUTPUT

python main.py \
    --model_path $MODEL_CHECKPOINT \
    --max_input_size 3000 \
    --output_dir $OUTPUT \
    --split train \
    --batch_size 1 \
    --dataset_name ontonotes_chinese \
    --no_pound_symbol

The data is available here: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes-chinese

Please let me know if it looks correct to you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MT5 fine-tuning #5

MT5 fine-tuning #5

Gypsophila1006 commented Sep 24, 2024

ianporada commented Sep 24, 2024

ianporada commented Sep 24, 2024

Gypsophila1006 commented Sep 25, 2024

ianporada commented Sep 25, 2024

ianporada commented Sep 25, 2024

MT5 fine-tuning #5

MT5 fine-tuning #5

Comments

Gypsophila1006 commented Sep 24, 2024

ianporada commented Sep 24, 2024

ianporada commented Sep 24, 2024

Gypsophila1006 commented Sep 25, 2024

ianporada commented Sep 25, 2024

ianporada commented Sep 25, 2024