-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the dependency to the models.csv file from ecologits #4
Comments
@samuelrince Maybe we can discuss this together with @inimaz |
Yes, good point, and it's the same question in the python package itself. How can we sync updates of models.csv without needing to release a new version and also keep it local first. If you have thought of a process or an architecture, we can work on that for both libs. |
It heavily depends on how often Another alternative to the submodule: we could generate the |
That good if we can trigger a new release for each release of the python lib.
The drawback of the release approach is that users will have to upate their dependencies to get the newest models, vs just a code update if its dynamic
Le mar. 25 juin 2024 à 17:49, inimaz ***@***.***(mailto:Le mar. 25 juin 2024 à 17:49, inimaz <<a href=)> a écrit :
… It heavily depends on how often models.csv is going to be updated.
Another alternative to the submodule: we could generate the models.csv file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time.
—
Reply to this email directly, [view it on GitHub](#4 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AA7J7WITLVL6Z2U2CKDPUHLZJGGSJAVCNFSM6AAAAABJ3XQGSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGMYTQMZTGM).
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz. For each release of the model_rep we generate a "database file" that contains all the models and hypotheses we used. This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning) Each time a new release of a client (ecologits or ecologits.js) happens, we inject the latest version of model_rep in CI. Plus, we can add a mechanism to check (max once a day?) if a new version of model_rep is available directly in the clients (with opt-out flag available). Thus, we can always have the latest version of model_rep without an update of the clients. This can also help us increase transparency on our hypotheses as well. E.g. if we change the parameters for gpt-4 through time, users can clearly see when we make the update and why. If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file. Example format (to challenge): [
{
"type": "model",
"provider": "openai",
"name": "gpt-3.5-turbo-01-0125",
"architecture": {
"type": "dense",
"parameters": {
"min": 20,
"max": 70
}
},
"warnings": [
"model_achitecture_not_released"
],
"sources": [
"https://platform.openai.com/docs/models/gpt-3-5-turbo",
"https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
],
},
{
"type": "alias",
"provider": "openai",
"name": "gpt-3.5-turbo",
"alias": "gpt-3.5-turbo-01-0125"
},
{
"type": "model",
"provider": "openai",
"name": "gpt-4-turbo-2024-04-09",
"architecture": {
"type": "moe",
"parameters": {
"total": 880,
"active": {
"min": 110,
"max": 440
}
}
},
"warnings": [
"model_achitecture_not_released",
"model_achitecture_multimodal"
],
"sources": [
"https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4",
"https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
]
},
{
"type": "model",
"provider": "mistralai",
"name": "open-mistral-7b",
"architecture": {
"type": "dense",
"parameters": 7.3
},
"warnings": null,
"sources": [
"https://docs.mistral.ai/models/#sizes "
]
}
] |
Sorry to just barge into this conversation, have a few pointers that might hopefully be useful.
If we're expecting the database file to update often (e.g. more than once a week), then yes, moving it into a separate repository will help decouple the release cycles of ecologits and ecologits.js from the database file.
Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).
That'll be awesome because it may make supporting dynamic fields based on the model type easier (4th point in description here). |
The idea is to have some kind of an API where we can check if the file has changed or not before downloading a new version. Like compare the hash of local vs remote and decide if we need to update or not. |
This is a perfect use case for File version check logic might look something like this:
Sample curl: curl -X HEAD -I https://raw.githubusercontent.com/genai-impact/ecologits/main/ecologits/data/models.csv In the sample curl output you'll see that the current models.csv has the etag currently set to More info on |
There is a tradeoff between fetching a remote file and have a descynchronized copy locally.
Solutions:
The text was updated successfully, but these errors were encountered: