Skip to content

Commit

Permalink
retrieval code
Browse files Browse the repository at this point in the history
  • Loading branch information
ArvinZhuang committed Oct 17, 2024
1 parent b1774c5 commit c54c4e9
Show file tree
Hide file tree
Showing 5 changed files with 159 additions and 25 deletions.
136 changes: 135 additions & 1 deletion retrieval/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,135 @@
# Starbucks
# Starbucks Representation Learning (SRL) fine-tuning for retrieval

## Installation
Our training code for passage retrieval is based on [Tevatron](https://github.com/texttron/tevatron) library.

To install Tevatron:
```bash
pip install transformers datasets peft
pip install deepspeed accelerate
pip install faiss-cpu # or 'conda install pytorch::faiss-gpu' for faiss gpu search
pip install wandb # for logging
git clone https://github.com/texttron/tevatron.git
cd tevatron
pip install -e .
cd ..
```

We also use [Pyserini](https://github.com/castorini/pyserini/tree/master) to evaluate the results.
To install it, run the following command:
```bash
conda install -c conda-forge openjdk=21 maven -y
pip install pyserini
```
If you have any issues with the pyserini installation, please follow this [link](https://github.com/castorini/pyserini/blob/master/docs/installation.md).

## Training
To train the model, run the following command:
```bash
python3 train.py \
--output_dir checkpoints/retriever/bert-srl-msmarco \
--model_name_or_path bert-base-uncased \
--tokenizer_name bert-base-uncased \
--srl_training \
--save_steps 2000 \
--dataset_name Tevatron/msmarco-passage \
--bf16 \
--pooling cls \
--gradient_checkpointing \
--per_device_train_batch_size 128 \
--train_group_size 8 \
--learning_rate 1e-4 \
--query_max_len 32 \
--passage_max_len 196 \
--num_train_epochs 3 \
--layer_list 2,4,6,8,10,12 \
--embedding_dim_list 32,64,128,256,512,768 \
--kl_divergence_weight 1 \
--logging_steps 10 \
--overwrite_output_dir \
--report_to wandb \
--run_name bert-srl-msmarco
```

If you want to fine-tune with our SMAE pre-trained model, replace `bert-base-uncased` with our checkpoint here [bert-base-uncased-fineweb100bt-smae](https://huggingface.co/ielabgroup/bert-base-uncased-fineweb100bt-smae).

We also released our fine-tuned model on Hugging Face Model Hub: [Starbucks-msmarco](https://huggingface.co/ielabgroup/Starbucks-msmarco).


## Evaluation
In this example, we use our released checkpoint [Starbucks-msmarco](https://huggingface.co/ielabgroup/Starbucks-msmarco) with dl19 dataset.
You can change `--model_name_or_path` to you own fine-tuned model.
### Step 1: Encode query and passage embeddings
#### Encode query:
```bash
python3 encode.py \
--output_dir=temp \
--model_name_or_path Starbucks-msmarco \
--bf16 \
--pooling cls \
--per_device_eval_batch_size 64 \
--query_max_len 32 \
--passage_max_len 196 \
--dataset_name Tevatron/msmarco-passage \
--dataset_split dl19 \
--encode_output_path embeddings/msmarco/query.dl19.pkl \
--encode_is_query \
--layers_to_save 2,4,6,8,10,12
```
Note, we save the full size embeddings from each target layer separately.

#### Encode passages
We shard the collection and encode each shard in parallel with multiple GPUs.
For example, if you have 2 GPUs, you can run the following commands:
```bash
mkdir -p embeddings/msmarco
NUM_AVAILABLE_GPUS=4
for i in $(seq 0 $((NUM_AVAILABLE_GPUS-1))); do
CUDA_VISIBLE_DEVICES=${i} python encode.py \
--output_dir=temp \
--model_name_or_path Starbucks-msmarco \
--bf16 \
--pooling cls \
--per_device_eval_batch_size 64 \
--query_max_len 32 \
--passage_max_len 196 \
--dataset_name Tevatron/msmarco-passage-corpus \
--encode_output_path embeddings/msmarco/corpus.${i}.pkl \
--layers_to_save 2,4,6,8,10,12 \
--layer_list 2,4,6,8,10,12 \
--embedding_dim_list 32,64,128,256,512,768 \
--dataset_number_of_shards ${NUM_AVAILABLE_GPUS} \
--dataset_shard_index ${i} &
done
wait
```

### Step 2: Perform retrieval and evaluate
We perform retrieval with target layer and embedding dimensionality.

For example, to perform retrieval with layer 6 and embedding dimension 128, run the following command:

```bash
n=6
d=128

python search.py \
--query_reps embeddings/msmarco/query.dev.pkl \
--passage_reps embeddings/msmarco/"corpus*.pkl" \
--depth 1000 \
--batch_size 64 \
--save_text \
--save_ranking_to runs/run.dl19.n$n.d$d.txt \
--embedding_dim $d

# convert the results to trec format
python -m tevatron.utils.format.convert_result_to_trec \
--input runs/run.dl19.n$n.d$d.txt \
--output runs/run.dl19.n$n.d$d.trec

# Evaluation
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage runs/run.dl19.n$n.d$d.trec

Results:
ndcg_cut_10 all 0.6346
```
44 changes: 22 additions & 22 deletions smae/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,28 @@ and pre-trained on [HuggingFaceFW/fineweb](https://huggingface.co/datasets/Huggi
To do SMAE the pre-training, run the following command:
```bash
python pretrain_bert.py \
--output_dir checkpoints/bert-base-uncased-fineweb100bt-smae \
--save_steps 10000 \
--bf16 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 2 \
--learning_rate 1e-4 \
--lr_scheduler_type cosine \
--weight_decay 0.001 \
--warmup_ratio 0.05 \
--num_train_epochs 1 \
--logging_steps 100 \
--mlm_probability 0.2 \
--decoder_mlm_probability 0.4 \
--report_to wandb \
--matryoshka_pretraining True \
--mae_pretraining True \
--run_name bert-base-uncased-fineweb100bt-smae \
--dataloader_num_workers 16 \
--num_processes 32 \
--save_safetensors False \
--log_level info \
--logging_nan_inf_filter False
--output_dir checkpoints/bert-base-uncased-fineweb100bt-smae \
--save_steps 10000 \
--bf16 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 2 \
--learning_rate 1e-4 \
--lr_scheduler_type cosine \
--weight_decay 0.001 \
--warmup_ratio 0.05 \
--num_train_epochs 1 \
--logging_steps 100 \
--mlm_probability 0.2 \
--decoder_mlm_probability 0.4 \
--report_to wandb \
--matryoshka_pretraining True \
--mae_pretraining True \
--run_name bert-base-uncased-fineweb100bt-smae \
--dataloader_num_workers 16 \
--num_processes 32 \
--save_safetensors False \
--log_level info \
--logging_nan_inf_filter False
```
In our experiments, we use 8 NVIDIA H100 GPUs, resulting in a total batch size of 512 with `gradient_accumulation_steps` of 2.
If you use slurm, we also provide a script example to run the pre-training on multiple node multi gpu training: [pretrain-smae.sh](pretrain_smae.sh).
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion smae/pretrain_bert.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from transformers import Trainer, AutoTokenizer, BertForMaskedLM
import torch
from dataclasses import dataclass, field
from modelling import BertFor2DMatryoshkaMaskedLM, BertFor2DMaekMatryoshkaMaskedLM
from modeling import BertFor2DMatryoshkaMaskedLM, BertFor2DMaekMatryoshkaMaskedLM
from data import MLMDataset, DataCollatorForWholeWordMaskWithAttentionMask, MaeDataCollatorForWholeWordMask

logger = logging.getLogger(__name__)
Expand Down
2 changes: 1 addition & 1 deletion sts/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Starbucks SRL fine-tuning for STS
# Starbucks Representation Learning (SRL) fine-tuning for STS

This repository contains the code for fine-tuning from any pre-trained model on the STS benchmark dataset.
This repo supports for three types of fine-tuning:
Expand Down

0 comments on commit c54c4e9

Please sign in to comment.