-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotate models trained on commercial-use-friendly licensed data #1634
Comments
@nramrakhiyani I know that the scandEval annotate such cases (see FAQ) of which I am a collaborator. However, I know that they e.g. include LLama 3.2 models as commercially viable, even though their dataset is not public. Generally though I do think that the pre-training data often makes this area a bit grey especially when it is not public. As far as I know it is not entirely certain that training on data not intended for commercial use makes the model unviable for commercial use (e.g. we wouldn't consider word frequencies estimated from newspaper data problematic). The best thing we can probably do is add the stated license of the model to the leaderboard, however I don't think we have the legal knowledge to discern viable commercial use (if someone has that knowledge and would like to share it we would be more than happy to incorporate it into the leaderboard) |
Thanks for your prompt help and answer @KennethEnevoldsen . I checked the FAQs for ScandEval and the distinction is clearly highlighted which is good. I also agree to the point that it should not be a concern to commercially deploy a model trained on data not intended for commercial use, as the model is a derivative not intended at reproducing/generating the training data. However in a specific case we are facing we are bound by a legal requirement arranged very strictly stating commercial-friendliness of both the model and its training data. Hence, I was interested in knowing if there are any text embedding models available satisfying this requirement. Before starting to do this myself, I was looking for someone who has spent the effort to remove the subset of non-commercial datasets from the list of all-miniLM-l6-v2's training datasets and train a local model (though trading-off some accuracy). |
Hmm I am not aware of such a case, @isaac-chung or @orionw might potentially know of such a case [a model where both training data and model are under commercial-friendly licensing] |
I think NQ is Apache licensed (https://github.com/google-research-datasets/natural-questions) but @KennethEnevoldsen is right that most models are trained on MS MARCO and/or a combination of many datasets. The only model I know trained solely on NQ is DPR which is quite old. However NQ is a pretty large resource and you could likely use it to make a better model these days. |
Thanks @KennethEnevoldsen for tagging additional interested researchers. |
Suppose if one is faced with a strict guideline of using models which are licensed so that their commercial use is allowed (there are many such models we can see on the mteb leader board) but more importantly, models which have been trained on data that is also licensed allowing commercial use. This rules out models such as all-MiniLM-L6-v2 or the all-mpnet-base-v2 as their training data includes some datasets which are only allowed for research/academic use (e.g. MS MARCO). Similarly, the e5 model paper reports use of the Common Crawl which is legally grey. Has someone encountered this kind of a scenario and found a model that satisfies both these model and training data license constraints? Any guidance in this regard will be valuable. Thank you.
The text was updated successfully, but these errors were encountered: