Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model characteristics dataset #3

Closed
samuelrince opened this issue Feb 27, 2024 · 4 comments · Fixed by #17
Closed

Model characteristics dataset #3

samuelrince opened this issue Feb 27, 2024 · 4 comments · Fixed by #17

Comments

@samuelrince
Copy link
Member

samuelrince commented Feb 27, 2024

Description

To compute the impacts of a query, we need some characteristics of the model that was used. Especially in the case of LLMs we need the total count of parameters.

Solution

A CSV or JSON file to store all known models and metadata like the total parameters count.

Considerations

Proprietary models

In many cases, we don't know the underlying architecture of models. thus we will need to guesstimate it (see issue #1 for OpenAI). The estimation can be based on performance achieved by this models in various leaderboards compared to open-weight models. It is crucial to keep the source of this assessment because it influences a lot the impacts.

Total parameters vs active parameters

In the case of mixture of experts models we can definie the active parameters count as the sum of all active/used parameters to run the computation. (example with mixtral).

@AndreaLeylavergne
Copy link
Collaborator

I can work on this issue. Is there anyone else to work with me? Is there any information already available so that i can better understand what is excepted here ?

@samuelrince
Copy link
Member Author

samuelrince commented Mar 4, 2024

A first thing would be to define a file format, to store information about models. We can start simple with that (example only):

Model name Total parameters Active parameters Source
mistralai/Mistral-7B-v0.1 7.3 7.3 https://docs.mistral.ai/models/
mistralai/Mixtral-8x7B-v0.1 46.7 12.9 https://docs.mistral.ai/models/
openai-community/gpt2-xl 1.5 1.5 https://huggingface.co/openai-community/gpt2-xl
gpt-3.5-turbo 20 20 #1

We need to take into account that we will probably include other data in the future, like min/max parameters value for proprietary models.

Also have a clear scope of how we can collect this data automatically. Probably someone else on the internet already worked on that?

@samuelrince
Copy link
Member Author

@samuelrince
Copy link
Member Author

Some clarifications after a quick meeting with @AndreaLeylavergne.

We will start with the main LLM providers (aligned with what we have already implemented in the package). So we will focus on OpenAI, Mistral AI and Anthropic first.

We need to report LLMs in the following spreadsheet : model repository

Column description:

  • provider: name of the provider in lower case (e.g. openai, mistralai, anthropic) ;
  • name: name of the model according to the official API documentation of each provider ;
  • total_parameters: the total parameters count of the model (will need to be assessed when unknown) ;
  • active_parameters: for sparse mixture of experts (sMoE) models, the number of active parameters at inference time (will need to be assessed when unknown).
  • warnings: a list of formatted warning related to the assumptions we made on the model parameters
  • sources: a list of formatted sources that we used to add the model (mainly URLs)

The model name should be the same as defined in API documentation:

To find popular models and some assessments on their architecture, we can use this database.

@samuelrince samuelrince linked a pull request Mar 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants