Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate models trained on commercial-use-friendly licensed data #1634

Open
nramrakhiyani opened this issue Dec 27, 2024 · 5 comments
Open
Labels
enhancement New feature or request leaderboard issues related to the leaderboard

Comments

@nramrakhiyani
Copy link

Suppose if one is faced with a strict guideline of using models which are licensed so that their commercial use is allowed (there are many such models we can see on the mteb leader board) but more importantly, models which have been trained on data that is also licensed allowing commercial use. This rules out models such as all-MiniLM-L6-v2 or the all-mpnet-base-v2 as their training data includes some datasets which are only allowed for research/academic use (e.g. MS MARCO). Similarly, the e5 model paper reports use of the Common Crawl which is legally grey. Has someone encountered this kind of a scenario and found a model that satisfies both these model and training data license constraints? Any guidance in this regard will be valuable. Thank you.

@KennethEnevoldsen KennethEnevoldsen added the leaderboard issues related to the leaderboard label Dec 27, 2024
@KennethEnevoldsen KennethEnevoldsen changed the title Models trained on commercial-use-friendly licensed data Annotate models trained on commercial-use-friendly licensed data Dec 27, 2024
@KennethEnevoldsen KennethEnevoldsen added the enhancement New feature or request label Dec 27, 2024
@KennethEnevoldsen
Copy link
Contributor

@nramrakhiyani I know that the scandEval annotate such cases (see FAQ) of which I am a collaborator. However, I know that they e.g. include LLama 3.2 models as commercially viable, even though their dataset is not public.

Generally though I do think that the pre-training data often makes this area a bit grey especially when it is not public.

As far as I know it is not entirely certain that training on data not intended for commercial use makes the model unviable for commercial use (e.g. we wouldn't consider word frequencies estimated from newspaper data problematic).

The best thing we can probably do is add the stated license of the model to the leaderboard, however I don't think we have the legal knowledge to discern viable commercial use (if someone has that knowledge and would like to share it we would be more than happy to incorporate it into the leaderboard)

@nramrakhiyani
Copy link
Author

Thanks for your prompt help and answer @KennethEnevoldsen . I checked the FAQs for ScandEval and the distinction is clearly highlighted which is good.

I also agree to the point that it should not be a concern to commercially deploy a model trained on data not intended for commercial use, as the model is a derivative not intended at reproducing/generating the training data. However in a specific case we are facing we are bound by a legal requirement arranged very strictly stating commercial-friendliness of both the model and its training data. Hence, I was interested in knowing if there are any text embedding models available satisfying this requirement. Before starting to do this myself, I was looking for someone who has spent the effort to remove the subset of non-commercial datasets from the list of all-miniLM-l6-v2's training datasets and train a local model (though trading-off some accuracy).
Thanks once again.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Dec 27, 2024

Hmm I am not aware of such a case, @isaac-chung or @orionw might potentially know of such a case [a model where both training data and model are under commercial-friendly licensing]

@orionw
Copy link
Contributor

orionw commented Dec 27, 2024

I think NQ is Apache licensed (https://github.com/google-research-datasets/natural-questions) but @KennethEnevoldsen is right that most models are trained on MS MARCO and/or a combination of many datasets. The only model I know trained solely on NQ is DPR which is quite old.

However NQ is a pretty large resource and you could likely use it to make a better model these days.

@nramrakhiyani
Copy link
Author

nramrakhiyani commented Dec 29, 2024

Thanks @KennethEnevoldsen for tagging additional interested researchers.
Also Thanks for the pointers, @orionw . I checked on NQ, which is licensed with the friendly Apache, but sadly DPR is shared under a non-commercial license. I also agree that may be I will have to collect such commercial friendly datasets and train an embedding model from scratch.
I will keep searching and will edit the issue if I find/train such a model. Hoping for other responses as well. Thanks for the support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request leaderboard issues related to the leaderboard
Projects
None yet
Development

No branches or pull requests

3 participants