-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse query vector #63
base: main
Are you sure you want to change the base?
Conversation
Hi Maxime, Thanks for integrating SPLADE using CSR matrices! I will be running it on my side and will let you know if it matches the numbers we have dataset by dataset (for the CoCodenser version). For the "original" SPLADE model, it is available here: https://github.com/naver/splade/tree/main/weights/distilsplade_max, but as individual files. We are working into making it available as a tar.gz as well. |
Hi Carlos,
Nice, thanks!
Uploading the model on the HuggingFace hub would also be possible (and easier to download). |
Hi @maximedb, Thanks again for making use of the CSR matrices for SPLADE. I would have a look at the PR and merge it with beir soon. A mention of a side project of mine: Sparse Retrieval (https://github.com/NThakur20/sparse-retrieval). We are currently developing a ready-to-use toolkit for efficient training and inference of all neural sparse retrieval models such as SPLADE, SPARTA, uniCOIL, TILDE, and DeepImpact. The implementation of SPLADE with CSR matrices works. However, we find it better and more efficient to use an Inverted index such as Pyserini. The project is planned to come out by end of February! We will keep you updated soon. We will reproduce the various sparse baselines and additionally upload the models on HF. Kind Regards, |
Really cool stuff! The multi-gpu encoding is a super cool feature :-) |
We would love to, but we are still seeing internally how we can do it. Here's a link for the "original model" in the same way as the new ones: https://download-de.europe.naverlabs.com/Splade_Release_Jan22/distilsplade_max.tar.gz
It looks really cool Nandan, I've starred and will keep an eye on it :) |
Hello, thanks for this great job, I am using uniCOIL model but it dose not give the same score that is reported on google sheet. I am using "castorini/unicoil-msmarco-passage". |
Hi @arthur-75, if you are not using document expansion - use the You should achieve around a nDCG@10 of 0.65683 with Reference: https://github.com/thakur-nandan/sprint The SPRINT toolkit implementation might be old so you can use that to reproduce, let me know if it doesn't. Regards, |
Hello @thakur-nandan , thank you very much for your response. I have tried model_path = "castorini/unicoil-noexp-msmarco-passage"
model = SparseSearch(models.UniCOIL(model_path=model_path,max_length=512
), batch_size=32)
retriever = EvaluateRetrieval(model, score_function="dot")
#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries, query_weights=True)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, [1,10])
ndcg
{'NDCG@1': 0.27,
'NDCG@10': 0.43848,} Dose SPRINT provide (sparse) vectors for docs and queries ? Thanks |
This PR builds upon #62.
It refactors the sparse search to represent queries and documents as CSR matrices. The SPARTA model is updated to fit this setup.
It also adds a clean SPLADE model along with an eval code. The SPLADE authors used a
DenseRetrievalExactSearch
in their demo script, but as SPLADE is labeled as a sparse model it should use aSparseSearch
in my opinion. The results are not directly comparable as it uses the co-condenser instead of distilbert as base model. I could not find a URL to download the link of the original model.Maxime.