Improve the dependency to the models.csv file from ecologits #4

ycouble · 2024-06-25T12:24:27Z

There is a tradeoff between fetching a remote file and have a descynchronized copy locally.

Solutions:

caching (if remote)
submodule
...

ycouble · 2024-06-25T12:32:31Z

@samuelrince Maybe we can discuss this together with @inimaz

samuelrince · 2024-06-25T13:37:32Z

Yes, good point, and it's the same question in the python package itself. How can we sync updates of models.csv without needing to release a new version and also keep it local first.

If you have thought of a process or an architecture, we can work on that for both libs.

inimaz · 2024-06-25T15:49:34Z

It heavily depends on how often models.csv is going to be updated.

Another alternative to the submodule: we could generate the models.csv file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time.

ycouble · 2024-06-25T15:58:08Z

That good if we can trigger a new release for each release of the python lib. The drawback of the release approach is that users will have to upate their dependencies to get the newest models, vs just a code update if its dynamic Le mar. 25 juin 2024 à 17:49, inimaz ***@***.***(mailto:Le mar. 25 juin 2024 à 17:49, inimaz <<a href=)> a écrit :

…

It heavily depends on how often models.csv is going to be updated. Another alternative to the submodule: we could generate the models.csv file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time. — Reply to this email directly, [view it on GitHub](#4 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AA7J7WITLVL6Z2U2CKDPUHLZJGGSJAVCNFSM6AAAAABJ3XQGSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGMYTQMZTGM). You are receiving this because you authored the thread.Message ID: ***@***.***>

samuelrince · 2024-07-01T12:11:02Z

I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.

For each release of the model_rep we generate a "database file" that contains all the models and hypotheses we used. This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)

Each time a new release of a client (ecologits or ecologits.js) happens, we inject the latest version of model_rep in CI. Plus, we can add a mechanism to check (max once a day?) if a new version of model_rep is available directly in the clients (with opt-out flag available).

Thus, we can always have the latest version of model_rep without an update of the clients.

This can also help us increase transparency on our hypotheses as well. E.g. if we change the parameters for gpt-4 through time, users can clearly see when we make the update and why.

If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.

Example format (to challenge):

[
    {
        "type": "model",
        "provider": "openai",
        "name": "gpt-3.5-turbo-01-0125",
        "architecture": {
            "type": "dense",
            "parameters": {
                "min": 20,
                "max": 70
            }
        },
        "warnings": [
            "model_achitecture_not_released"
        ],
        "sources": [
            "https://platform.openai.com/docs/models/gpt-3-5-turbo",
            "https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
        ],
    },
    {
        "type": "alias",
        "provider": "openai",
        "name": "gpt-3.5-turbo",
        "alias": "gpt-3.5-turbo-01-0125"
    },
    {
        "type": "model",
        "provider": "openai",
        "name": "gpt-4-turbo-2024-04-09",
        "architecture": {
            "type": "moe",
            "parameters": {
                "total": 880,
                "active": {
                    "min": 110,
                    "max": 440
                }
            }
        },
        "warnings": [
            "model_achitecture_not_released",
            "model_achitecture_multimodal"
        ],
        "sources": [
            "https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4",
            "https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
        ]
    },
    {
        "type": "model",
        "provider": "mistralai",
        "name": "open-mistral-7b",
        "architecture": {
            "type": "dense",
            "parameters": 7.3
        },
        "warnings": null,
        "sources": [
            "https://docs.mistral.ai/models/#sizes "
        ]
    }
]

omkar-foss · 2024-08-08T11:19:44Z

Sorry to just barge into this conversation, have a few pointers that might hopefully be useful.

I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.

If we're expecting the database file to update often (e.g. more than once a week), then yes, moving it into a separate repository will help decouple the release cycles of ecologits and ecologits.js from the database file.

This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)

Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).

If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.

That'll be awesome because it may make supporting dynamic fields based on the model type easier (4th point in description here).

samuelrince · 2024-08-09T09:27:57Z

Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).

The idea is to have some kind of an API where we can check if the file has changed or not before downloading a new version. Like compare the hash of local vs remote and decide if we need to update or not.

omkar-foss · 2024-08-09T18:35:06Z

Like compare the hash of local vs remote and decide if we need to update or not.

This is a perfect use case for etag. GitHub sends the etag header in the raw file response, which contains a hash denoting the current file version.

File version check logic might look something like this:

On first load, we download the file and save it's etag hash value.
We make a HEAD request to GitHub to only get the file response headers without the actual file (sample curl below).
Then we can check if etag header hash values from steps 1 & 2 match, if they match then do nothing, we've the latest file already.
If they don't match, it indicates a new file version, and we initiate a download of the new file by making a GET request to the same file URL and then we also update our local etag hash value like in step 1.

Sample curl:

curl -X HEAD -I https://raw.githubusercontent.com/genai-impact/ecologits/main/ecologits/data/models.csv

In the sample curl output you'll see that the current models.csv has the etag currently set to 41a68510227fa2c99cf9d7f6635abd16f4a672e2719ba95eca1b70de5496caf9.

More info on etag header in MDN docs here.

samuelrince mentioned this issue Jul 4, 2024

Enhance model repository genai-impact/ecologits#58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the dependency to the models.csv file from ecologits #4

Improve the dependency to the models.csv file from ecologits #4

ycouble commented Jun 25, 2024

ycouble commented Jun 25, 2024

samuelrince commented Jun 25, 2024

inimaz commented Jun 25, 2024

ycouble commented Jun 25, 2024 via email

samuelrince commented Jul 1, 2024

omkar-foss commented Aug 8, 2024 •

edited

Loading

samuelrince commented Aug 9, 2024

omkar-foss commented Aug 9, 2024

Improve the dependency to the models.csv file from ecologits #4

Improve the dependency to the models.csv file from ecologits #4

Comments

ycouble commented Jun 25, 2024

ycouble commented Jun 25, 2024

samuelrince commented Jun 25, 2024

inimaz commented Jun 25, 2024

ycouble commented Jun 25, 2024 via email

samuelrince commented Jul 1, 2024

omkar-foss commented Aug 8, 2024 • edited Loading

samuelrince commented Aug 9, 2024

omkar-foss commented Aug 9, 2024

omkar-foss commented Aug 8, 2024 •

edited

Loading