Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better concatenation and individual metrics when using multiple text datasets #22

Open
bhavul opened this issue Oct 9, 2023 · 5 comments
Labels
enhancement Improvements to existing features good first issue Good for newcomers

Comments

@bhavul
Copy link
Contributor

bhavul commented Oct 9, 2023

For text task, when we would have multiple datasets, concatenation strategy could be moved to a more sophisticated logic by using huggingface concatenation.

Further, we may wish to change the evaluation loop to also give metrics individual to each dataset besides the average.

The text task looks good so far, I am curious about the choice / what you think is the best way to handle having multiple datasets. Are there speed benefits following the process here, of concatenating the datasets? If we had separate tasks, then we would also want to calculate the total tokens for each task / proportionally calculate how much of each batch comes from each task depending on # of tokens, but we don't have to worry about this if following your procedure. It seems like there is an edge case where the concatenation will not work if the columns are not named the same: https://huggingface.co/docs/datasets/process#concatenate

One thing that may be useful, is that if we have multiple datasets which are concatenated, is during evaluation, is to determine specific metrics associated w/ each separate dataset. E.g., want a separate perplexity score for wikitext vs the pile, not just the average between both. Potentially, after contenating, can maybe maintain start and end indices for each dataset, e.g. pile is 0 to 200mil, other dataset is (200mil + 1) to 400mil, so we can attribute which samples correspond to each task, separately aggregate their metrics.

Another strategy is that during training, we just track the average, but after training finishes, you essentially load model, e.g. eval.py, just running over each of your tasks separately, where text_datasets={the specific dataset you want your eval metrics over}, but may be inconvenient

Originally posted by @daniellawson9999 in #1 (comment)

@bhavul bhavul added enhancement Improvements to existing features good first issue Good for newcomers labels Dec 20, 2023
@Shravya-Kasturi
Copy link

Hi @bhavul , I would like to work on this issue. Anything i should know before I start working?

@pritam5756
Copy link

hi @bhavul is this issue fixed?

@harshsikka
Copy link
Member

@Pritam-hakingmaster this is a good issue to get started with - the basic concatenation is implemented using hugginface datasets (see line 29 in gato/tasks/text_task.py)

The individual metrics per dataset is an open challenge that no one has taken on yet, worth doing

@pritam5756
Copy link

I will try my best.

@pritam5756
Copy link

I would like work on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements to existing features good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants