Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce the Abt-Buy and Amazon-Google results #2

Open
ptyshevs opened this issue Dec 27, 2022 · 10 comments
Open

Reproduce the Abt-Buy and Amazon-Google results #2

ptyshevs opened this issue Dec 27, 2022 · 10 comments

Comments

@ptyshevs
Copy link

Hi, couldn't find the preprocessing you've done for the mentioned datasets. Curious to see how to obtain the similar metrics as you've reported in your paper

@rpeeters85
Copy link
Contributor

Hi, when you run the download_datasets.py referred to in the README.MD the abt-buy and amazon-google datasets are included. These files are the same you can download from the deepmatcher repository: https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md

Starting from these source datasets, the datasets are further processed in the preprocess_deepmatcher_datasets.py and prepare_data_deepmatcher.py scripts (see README.MD for their location). These scripts are mainly concerned with creating the correct structure e.g. for contrastive learning.

Some further processing like for example the serialization into a string and the application of a transformer tokenizer is then finally done in the datasets classes defined in the datasets.py script under src/contrastive/data

If you follow the instructions in the README.MD you should be able to follow along the code and reproduce the exact results from the paper.
Please let me know if this has helped or you have any remaining questions!

@Syno8
Copy link

Syno8 commented Feb 17, 2023

@rpeeters85 Although I read our code carefully, I didn't find the implementation of Batch Building Process mentioned in paper. Perhaps I have overlooked some details, please advise me.

prepare_data_deepmatcher.py is to generate a cluster id for each example to contrastive training.
preprocess-deepmatcher-datasets.py also generates the cluster id and does nothing for batch building.

@rpeeters85
Copy link
Contributor

The batch building process for contrastive training is found in the respective dataset class in src/contrastive/data/datasets.py e.g. for the ContrastivePretrainDataset class, have a look at the get_item method that handles how a single item is selected from the respective dataset during batch building.

You will see here that we do not only select a single item but always also a corresponding positive to that item which is a little bit different from the original SupCon implementation which uses two augmented views of the same example at this point. Augmentation in the R-SupCon implementation is thus entirely optional.

Comprising the batches of such "pairs" of matching examples is handled by the transformers library in the background this is not explicitly handled in this code. After a batch is built it is passed to the corresponding data_collator in src/contrastive/data/data_collators.py where the actual tokenization using RoBERTas tokenizer takes place and the batch is transformed to tensors ready for input to the model itself.

For example the output of the DataCollatorContrastivePretrain is a batch of inputs, each comprised of a (optionally augmented) single example and a randomly selected positive with their corresponding label.

I hope this clears up the batch building procedure and where to find it, let me know if something remains unclear.

@Syno8
Copy link

Syno8 commented Feb 19, 2023

how about the contrastive learning in finetuning steps? ContrastiveClassificationDataset
I found a phenomenon in evaluating, it will be higher and lower in different evaluations. That is because my model will output 1 or 0 for all examples. @rpeeters85 have any tips for me?

@Syno8
Copy link

Syno8 commented Feb 21, 2023

@ptyshevs hi, did you reproduce the results described in the paper? or lspc ?

@Syno8
Copy link

Syno8 commented Feb 21, 2023

@rpeeters85 Hi,
The first time I tried your script on lspc dataset, I couldn't run it directly. It was probably an environmental problem. According to my understanding, I changed collators run it, and the indicators were as high and low as I described earlier.
features_left = [x['features_left'][0] for x in input] features_right = [x['features_right'][0] for x in input]

features_left = [x['features_left'] for x in input]

The second time I followed your script, the environment was consistent, and I could run through it directly, but the f1-score was

f1 =0.7846 (split + aug)

f1 =0.851153039 (nosplit)
f1 = 0.848232848 (nosplit + aug)

on abtbuy dataset

BATCH_SIZE=1024
LEARNING_RATE=5e-05
TEMPERATURE=0.07
AUG=all

GPU=8
BATCH_SIZE=64
LEARNING_RATE=5e-05
TEMPERATURE=0.07
AUG=all
PREAUG=all

@rpeeters85
Copy link
Contributor

In the relevant script files (pretraining and fine-tuning) a random seed is set to make the results reproducible so there should be no change in F1 across multiple runs. For what datasets are the F1s? If it is the LSPC, those should not be used with the nosplit/split option at all since this is just relevant for the source-aware sampling in the deepmatcher datasets.

I don't understand what you mean by changing collators, there is one collator to handle the pretraining (either DataCollatorContrastivePretrain for LSPC datasets or DataCollatorContrastivePretrainDeepmatcher for the Deepmatcher datasets) and one to handle the fine-tuning step DataCollatorContrastiveClassification. Both are imported and set in the respective script files (e.g. run_pretraining.py/run_pretraining_deepmatcher.py for pretraining and run_finetune_siamese.py for fine-tuning)

Regarding your second comment: There is no contrastive learning in the fine-tuning stage. Here the batch building process is based on the pairs found in the predefined train/validation/test splits which are subsequently embedded using the frozen pretrained transformer, combined using the combination presented in the paper (this happens in the forward method of the classification model) and passed to a linear layer for the matching/non-matching decisison.

@Syno8
Copy link

Syno8 commented Feb 22, 2023

@rpeeters85 thank you for your time,

I don't understand what you mean by changing collators,

that is may be caused by env. I have solved it by create new env as yours.

the score f1 =0.7846 (split + aug) f1 =0.851153039 (nosplit) f1 = 0.848232848 (nosplit + aug) is the results on abtbuy dataset, followed by your scripts and env.

I can reproduce your results on lspc after I use the env as yours. I confused by the results scores on abtbuy datasets. How about your running in a single GPU or multi-GPUs? if you are using multi-GPUs, how many are used in your experiments?

@rpeeters85
Copy link
Contributor

Yes those results are low for abtbuy and nosplit should actually perform much worse than split. I am only running on single GPUs. I do not know what happens if you try multi-GPU training. This will be handled somewhere in the huggingface code and I would assume it does not work with my custom batch building in the collators since they are somewhat different compared to a default huggingface batch for RoBERTa.

Did you get your results with multi-gpu training? If yes, definitely use only single GPUs and report back. If not, I can try rerunning this repo from scratch for abt-buy in the coming days to try and reproduce the problem.

@Syno8
Copy link

Syno8 commented Feb 22, 2023

@rpeeters85

About the performance of a single GPU and multi-GPU, I try to verify it. and I found that I obtain a normal score on abtbuy datasets with a single GPU. this is so strange.

I have a question about the code :

if condition row['label'] != 1, the left and right should be in the same bucket or not?

during training, this code will print some numbers as follows:

[4]
[2]
[1]
[2]
[5]
[1, 0]
[3]
[0]
[3]
[2]
[1]
[1]
[1]

I have tried to find it in the code. I thought it is caused by transformers data collators. if you can tell me without much effort, I will thanks so much, if it no attracts your attention, I can see more transformers details by myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants