Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse query vector #63

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

maximedb
Copy link
Contributor

@maximedb maximedb commented Feb 9, 2022

This PR builds upon #62.

It refactors the sparse search to represent queries and documents as CSR matrices. The SPARTA model is updated to fit this setup.

It also adds a clean SPLADE model along with an eval code. The SPLADE authors used a DenseRetrievalExactSearch in their demo script, but as SPLADE is labeled as a sparse model it should use a SparseSearch in my opinion. The results are not directly comparable as it uses the co-condenser instead of distilbert as base model. I could not find a URL to download the link of the original model.

Maxime.

@cadurosar
Copy link

Hi Maxime,

Thanks for integrating SPLADE using CSR matrices! I will be running it on my side and will let you know if it matches the numbers we have dataset by dataset (for the CoCodenser version).

For the "original" SPLADE model, it is available here: https://github.com/naver/splade/tree/main/weights/distilsplade_max, but as individual files. We are working into making it available as a tar.gz as well.

@maximedb
Copy link
Contributor Author

Hi Carlos,

Thanks for integrating SPLADE using CSR matrices! I will be running it on my side and will let you know if it matches the numbers we have dataset by dataset (for the CoCodenser version).

Nice, thanks!

For the "original" SPLADE model, it is available here: https://github.com/naver/splade/tree/main/weights/distilsplade_max, but as individual files. We are working into making it available as a tar.gz as well.

Uploading the model on the HuggingFace hub would also be possible (and easier to download).

@thakur-nandan
Copy link
Member

Hi @maximedb,

Thanks again for making use of the CSR matrices for SPLADE. I would have a look at the PR and merge it with beir soon.

A mention of a side project of mine: Sparse Retrieval (https://github.com/NThakur20/sparse-retrieval). We are currently developing a ready-to-use toolkit for efficient training and inference of all neural sparse retrieval models such as SPLADE, SPARTA, uniCOIL, TILDE, and DeepImpact. The implementation of SPLADE with CSR matrices works. However, we find it better and more efficient to use an Inverted index such as Pyserini. The project is planned to come out by end of February! We will keep you updated soon.

We will reproduce the various sparse baselines and additionally upload the models on HF.

Kind Regards,
Nandan Thakur

@maximedb
Copy link
Contributor Author

maximedb commented Feb 10, 2022

Really cool stuff! The multi-gpu encoding is a super cool feature :-)

@cadurosar
Copy link

cadurosar commented Feb 10, 2022

Uploading the model on the HuggingFace hub would also be possible (and easier to download).

We would love to, but we are still seeing internally how we can do it. Here's a link for the "original model" in the same way as the new ones: https://download-de.europe.naverlabs.com/Splade_Release_Jan22/distilsplade_max.tar.gz

A mention of a side project of mine: Sparse Retrieval (https://github.com/NThakur20/sparse-retrieval). We are currently developing a ready-to-use toolkit for efficient training and inference of all neural sparse retrieval models such as SPLADE, SPARTA, uniCOIL, TILDE, and DeepImpact. The implementation of SPLADE with CSR matrices works. However, we find it better and more efficient to use an Inverted index such as Pyserini. The project is planned to come out by end of February! We will keep you updated soon.

It looks really cool Nandan, I've starred and will keep an eye on it :)

@arthur-75
Copy link

Hello, thanks for this great job, I am using uniCOIL model but it dose not give the same score that is reported on google sheet. I am using "castorini/unicoil-msmarco-passage".
castorini/unicoil-d2q-msmarco-passage it dose not exist anymore is it similar to "castorini/unicoil-msmarco-passage" ?
have a nice day

@thakur-nandan
Copy link
Member

Hi @arthur-75, if you are not using document expansion - use the castorini/unicoil-noexp-msmarco-passage model which the uniCOIL model fine-tuned without expansion. the one you are using is the one with expansion.

You should achieve around a nDCG@10 of 0.65683 with castorini/unicoil-noexp-msmarco-passage on scifact dataset.

Reference: https://github.com/thakur-nandan/sprint
Paper: https://arxiv.org/abs/2307.10488

The SPRINT toolkit implementation might be old so you can use that to reproduce, let me know if it doesn't.

Regards,
Nandan

@arthur-75
Copy link

Hi @arthur-75, if you are not using document expansion - use the castorini/unicoil-noexp-msmarco-passage model which the uniCOIL model fine-tuned without expansion. the one you are using is the one with expansion.

You should achieve around a nDCG@10 of 0.65683 with castorini/unicoil-noexp-msmarco-passage on scifact dataset.

Reference: https://github.com/thakur-nandan/sprint Paper: https://arxiv.org/abs/2307.10488

The SPRINT toolkit implementation might be old so you can use that to reproduce, let me know if it doesn't.

Regards, Nandan

Hello @thakur-nandan , thank you very much for your response. I have tried castorini/unicoil-noexp-msmarco-passage with scifact but unfortunately it's not working well here is my code :

model_path =  "castorini/unicoil-noexp-msmarco-passage"
model = SparseSearch(models.UniCOIL(model_path=model_path,max_length=512
                                    ), batch_size=32)
retriever = EvaluateRetrieval(model, score_function="dot")
#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries, query_weights=True)

ndcg, _map, recall, precision = retriever.evaluate(qrels, results, [1,10])
ndcg
{'NDCG@1': 0.27,
 'NDCG@10': 0.43848,}

Dose SPRINT provide (sparse) vectors for docs and queries ?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants