Annotate models trained on commercial-use-friendly licensed data #1634

nramrakhiyani · 2024-12-27T09:16:35Z

Suppose if one is faced with a strict guideline of using models which are licensed so that their commercial use is allowed (there are many such models we can see on the mteb leader board) but more importantly, models which have been trained on data that is also licensed allowing commercial use. This rules out models such as all-MiniLM-L6-v2 or the all-mpnet-base-v2 as their training data includes some datasets which are only allowed for research/academic use (e.g. MS MARCO). Similarly, the e5 model paper reports use of the Common Crawl which is legally grey. Has someone encountered this kind of a scenario and found a model that satisfies both these model and training data license constraints? Any guidance in this regard will be valuable. Thank you.

KennethEnevoldsen · 2024-12-27T09:50:00Z

@nramrakhiyani I know that the scandEval annotate such cases (see FAQ) of which I am a collaborator. However, I know that they e.g. include LLama 3.2 models as commercially viable, even though their dataset is not public.

Generally though I do think that the pre-training data often makes this area a bit grey especially when it is not public.

As far as I know it is not entirely certain that training on data not intended for commercial use makes the model unviable for commercial use (e.g. we wouldn't consider word frequencies estimated from newspaper data problematic).

The best thing we can probably do is add the stated license of the model to the leaderboard, however I don't think we have the legal knowledge to discern viable commercial use (if someone has that knowledge and would like to share it we would be more than happy to incorporate it into the leaderboard)

nramrakhiyani · 2024-12-27T10:33:23Z

Thanks for your prompt help and answer @KennethEnevoldsen . I checked the FAQs for ScandEval and the distinction is clearly highlighted which is good.

I also agree to the point that it should not be a concern to commercially deploy a model trained on data not intended for commercial use, as the model is a derivative not intended at reproducing/generating the training data. However in a specific case we are facing we are bound by a legal requirement arranged very strictly stating commercial-friendliness of both the model and its training data. Hence, I was interested in knowing if there are any text embedding models available satisfying this requirement. Before starting to do this myself, I was looking for someone who has spent the effort to remove the subset of non-commercial datasets from the list of all-miniLM-l6-v2's training datasets and train a local model (though trading-off some accuracy).
Thanks once again.

KennethEnevoldsen · 2024-12-27T16:28:36Z

Hmm I am not aware of such a case, @isaac-chung or @orionw might potentially know of such a case [a model where both training data and model are under commercial-friendly licensing]

orionw · 2024-12-27T16:34:46Z

I think NQ is Apache licensed (https://github.com/google-research-datasets/natural-questions) but @KennethEnevoldsen is right that most models are trained on MS MARCO and/or a combination of many datasets. The only model I know trained solely on NQ is DPR which is quite old.

However NQ is a pretty large resource and you could likely use it to make a better model these days.

nramrakhiyani · 2024-12-29T14:50:55Z

Thanks @KennethEnevoldsen for tagging additional interested researchers.
Also Thanks for the pointers, @orionw . I checked on NQ, which is licensed with the friendly Apache, but sadly DPR is shared under a non-commercial license. I also agree that may be I will have to collect such commercial friendly datasets and train an embedding model from scratch.
I will keep searching and will edit the issue if I find/train such a model. Hoping for other responses as well. Thanks for the support.

isaac-chung · 2025-02-14T04:21:55Z

Closing this for now. Feel free to reopen if anything relevant pops up.

KennethEnevoldsen added the leaderboard issues related to the leaderboard label Dec 27, 2024

KennethEnevoldsen changed the title ~~Models trained on commercial-use-friendly licensed data~~ Annotate models trained on commercial-use-friendly licensed data Dec 27, 2024

KennethEnevoldsen added the enhancement New feature or request label Dec 27, 2024

isaac-chung closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotate models trained on commercial-use-friendly licensed data #1634

Annotate models trained on commercial-use-friendly licensed data #1634

nramrakhiyani commented Dec 27, 2024

KennethEnevoldsen commented Dec 27, 2024

nramrakhiyani commented Dec 27, 2024

KennethEnevoldsen commented Dec 27, 2024 •

edited

Loading

orionw commented Dec 27, 2024

nramrakhiyani commented Dec 29, 2024 •

edited

Loading

isaac-chung commented Feb 14, 2025

Annotate models trained on commercial-use-friendly licensed data #1634

Annotate models trained on commercial-use-friendly licensed data #1634

Comments

nramrakhiyani commented Dec 27, 2024

KennethEnevoldsen commented Dec 27, 2024

nramrakhiyani commented Dec 27, 2024

KennethEnevoldsen commented Dec 27, 2024 • edited Loading

orionw commented Dec 27, 2024

nramrakhiyani commented Dec 29, 2024 • edited Loading

isaac-chung commented Feb 14, 2025

KennethEnevoldsen commented Dec 27, 2024 •

edited

Loading

nramrakhiyani commented Dec 29, 2024 •

edited

Loading