Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model jxm/cde-small-v1 #1521

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion mteb/leaderboard/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def get_means_per_types(df: pd.DataFrame) -> pd.DataFrame:
def failsafe_get_model_meta(model_name):
try:
return get_model_meta(model_name)
except Exception as e:
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since your PR is not concerned with the leaderboard, you probably shouldn't put changes in it related to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe that was a result of running make lint, however I can leave that out.

return None


Expand Down
34 changes: 34 additions & 0 deletions mteb/models/cde-small-v1_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from __future__ import annotations

import mteb

model = mteb.get_model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly, but it seems like you did not add a model implementation or model metadata for CDEs. I'm also unsure whether this would work or not. I believe their official guide on how to use CDE is a bit more complicated than this, since they have a first and a sceond stage in all of their guides where they first produce a corpus embedding and then pass it along to the model when embedding new documents.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but I guess it's a better choice still not to implement the model incorrectly here, and maybe just add metadata on it, then ask the CDE team to upload their results to the results repository.
I don't see too much value in adding a script here, that does not use CDEs as they are supposed to be used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I added it evaluation script just for information and show author's implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x-tabdeveloping

I didn't explicitly define the model metadata because when I ran the mteb.get_model_meta command, the output seemed correct. However, I may have misunderstood and overlooked the need to explicitly define the model metadata.

I also have the results repository from when I ran the script. Should I disregard that?

I'm a bit unsure about the next steps I should take. I would appreciate your guidance—thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check if your results are matching with results reported by the authors. You can check results in https://huggingface.co/spaces/mteb/leaderboard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YashDThapliyal When you run get_model_meta, it can only do so much. It fetches some information about the model, but likely won't have any information on the supported languages, release date, license, etc.
I'd encourage you to add a metadata object for the model and obtain this information from their paper or their HF repo readme.

I'm pretty sure the results are not entirely correct, as their use of the model includes producing a contextual corpus embedding, and only then embedding documents. I think you should take a look at the link @Samoed sent for their eval script.

I might be wrong. If your results match with theirs, then sorry for the trouble. I'd look into this though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samoed @x-tabdeveloping Thank you for highlighting the corpus document note I will take a look at that and ensure that the model has been implemented correctly—I’ll make sure to update the model metadata with additional information as needed.

Once the implementation is complete, I’ll compare the results with the existing ones to ensure alignment.

Quick question: What would be the best approach to define the model revision in this context?

Thank you for your guidance!

Best regards,
Yash

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'd take the last commit's ID on their HF repo as revision.

"jxm/cde-small-v1",
trust_remote_code=True,
model_prompts={"query": "search_query: ", "passage": "search_document: "},
)
tasks = mteb.get_tasks(
tasks=[
# classification
"AmazonCounterfactualClassification",
# clustering
"RedditClustering",
# pair classification
"TwitterSemEval2015",
# reranking
"AskUbuntuDupQuestions",
# retrieval
"SCIDOCS",
# # sts
"STS22",
# # summarization
"SummEval",
]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
model,
output_folder="results",
extra_kwargs={"batch_size": 8},
overwrite_results=True,
)