Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the dependency to the models.csv file from ecologits #4

Open
ycouble opened this issue Jun 25, 2024 · 8 comments
Open

Improve the dependency to the models.csv file from ecologits #4

ycouble opened this issue Jun 25, 2024 · 8 comments

Comments

@ycouble
Copy link
Collaborator

ycouble commented Jun 25, 2024

There is a tradeoff between fetching a remote file and have a descynchronized copy locally.

Solutions:

  • caching (if remote)
  • submodule
  • ...
@ycouble
Copy link
Collaborator Author

ycouble commented Jun 25, 2024

@samuelrince Maybe we can discuss this together with @inimaz

@samuelrince
Copy link
Member

Yes, good point, and it's the same question in the python package itself. How can we sync updates of models.csv without needing to release a new version and also keep it local first.

If you have thought of a process or an architecture, we can work on that for both libs.

@inimaz
Copy link
Collaborator

inimaz commented Jun 25, 2024

It heavily depends on how often models.csv is going to be updated.

Another alternative to the submodule: we could generate the models.csv file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time.

@ycouble
Copy link
Collaborator Author

ycouble commented Jun 25, 2024 via email

@samuelrince
Copy link
Member

I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.

For each release of the model_rep we generate a "database file" that contains all the models and hypotheses we used. This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)

Each time a new release of a client (ecologits or ecologits.js) happens, we inject the latest version of model_rep in CI. Plus, we can add a mechanism to check (max once a day?) if a new version of model_rep is available directly in the clients (with opt-out flag available).

Thus, we can always have the latest version of model_rep without an update of the clients.

This can also help us increase transparency on our hypotheses as well. E.g. if we change the parameters for gpt-4 through time, users can clearly see when we make the update and why.

If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.

Example format (to challenge):

[
    {
        "type": "model",
        "provider": "openai",
        "name": "gpt-3.5-turbo-01-0125",
        "architecture": {
            "type": "dense",
            "parameters": {
                "min": 20,
                "max": 70
            }
        },
        "warnings": [
            "model_achitecture_not_released"
        ],
        "sources": [
            "https://platform.openai.com/docs/models/gpt-3-5-turbo",
            "https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
        ],
    },
    {
        "type": "alias",
        "provider": "openai",
        "name": "gpt-3.5-turbo",
        "alias": "gpt-3.5-turbo-01-0125"
    },
    {
        "type": "model",
        "provider": "openai",
        "name": "gpt-4-turbo-2024-04-09",
        "architecture": {
            "type": "moe",
            "parameters": {
                "total": 880,
                "active": {
                    "min": 110,
                    "max": 440
                }
            }
        },
        "warnings": [
            "model_achitecture_not_released",
            "model_achitecture_multimodal"
        ],
        "sources": [
            "https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4",
            "https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
        ]
    },
    {
        "type": "model",
        "provider": "mistralai",
        "name": "open-mistral-7b",
        "architecture": {
            "type": "dense",
            "parameters": 7.3
        },
        "warnings": null,
        "sources": [
            "https://docs.mistral.ai/models/#sizes "
        ]
    }
]

@omkar-foss
Copy link

omkar-foss commented Aug 8, 2024

Sorry to just barge into this conversation, have a few pointers that might hopefully be useful.

I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.

If we're expecting the database file to update often (e.g. more than once a week), then yes, moving it into a separate repository will help decouple the release cycles of ecologits and ecologits.js from the database file.

This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)

Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).

If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.

That'll be awesome because it may make supporting dynamic fields based on the model type easier (4th point in description here).

@samuelrince
Copy link
Member

Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).

The idea is to have some kind of an API where we can check if the file has changed or not before downloading a new version. Like compare the hash of local vs remote and decide if we need to update or not.

@omkar-foss
Copy link

Like compare the hash of local vs remote and decide if we need to update or not.

This is a perfect use case for etag. GitHub sends the etag header in the raw file response, which contains a hash denoting the current file version.

File version check logic might look something like this:

  1. On first load, we download the file and save it's etag hash value.
  2. We make a HEAD request to GitHub to only get the file response headers without the actual file (sample curl below).
  3. Then we can check if etag header hash values from steps 1 & 2 match, if they match then do nothing, we've the latest file already.
  4. If they don't match, it indicates a new file version, and we initiate a download of the new file by making a GET request to the same file URL and then we also update our local etag hash value like in step 1.

Sample curl:

curl -X HEAD -I https://raw.githubusercontent.com/genai-impact/ecologits/main/ecologits/data/models.csv

In the sample curl output you'll see that the current models.csv has the etag currently set to 41a68510227fa2c99cf9d7f6635abd16f4a672e2719ba95eca1b70de5496caf9.

More info on etag header in MDN docs here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants