Better concatenation and individual metrics when using multiple text datasets #22

bhavul · 2023-10-09T04:46:34Z

For text task, when we would have multiple datasets, concatenation strategy could be moved to a more sophisticated logic by using huggingface concatenation.

Further, we may wish to change the evaluation loop to also give metrics individual to each dataset besides the average.

The text task looks good so far, I am curious about the choice / what you think is the best way to handle having multiple datasets. Are there speed benefits following the process here, of concatenating the datasets? If we had separate tasks, then we would also want to calculate the total tokens for each task / proportionally calculate how much of each batch comes from each task depending on # of tokens, but we don't have to worry about this if following your procedure. It seems like there is an edge case where the concatenation will not work if the columns are not named the same: https://huggingface.co/docs/datasets/process#concatenate

One thing that may be useful, is that if we have multiple datasets which are concatenated, is during evaluation, is to determine specific metrics associated w/ each separate dataset. E.g., want a separate perplexity score for wikitext vs the pile, not just the average between both. Potentially, after contenating, can maybe maintain start and end indices for each dataset, e.g. pile is 0 to 200mil, other dataset is (200mil + 1) to 400mil, so we can attribute which samples correspond to each task, separately aggregate their metrics.

Another strategy is that during training, we just track the average, but after training finishes, you essentially load model, e.g. eval.py, just running over each of your tasks separately, where text_datasets={the specific dataset you want your eval metrics over}, but may be inconvenient

Originally posted by @daniellawson9999 in #1 (comment)

Shravya-Kasturi · 2023-12-21T10:43:19Z

Hi @bhavul , I would like to work on this issue. Anything i should know before I start working?

pritam5756 · 2024-06-10T13:13:33Z

hi @bhavul is this issue fixed?

harshsikka · 2024-06-11T06:49:14Z

@Pritam-hakingmaster this is a good issue to get started with - the basic concatenation is implemented using hugginface datasets (see line 29 in gato/tasks/text_task.py)

The individual metrics per dataset is an open challenge that no one has taken on yet, worth doing

pritam5756 · 2024-06-11T09:02:44Z

I will try my best.

pritam5756 · 2024-08-16T06:44:22Z

I would like work on this issue.

bhavul mentioned this issue Oct 9, 2023

Add text modality, and some basic refactoring #1

Merged

bhavul added enhancement Improvements to existing features good first issue Good for newcomers labels Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better concatenation and individual metrics when using multiple text datasets #22

Better concatenation and individual metrics when using multiple text datasets #22

bhavul commented Oct 9, 2023

Shravya-Kasturi commented Dec 21, 2023

pritam5756 commented Jun 10, 2024

harshsikka commented Jun 11, 2024

pritam5756 commented Jun 11, 2024

pritam5756 commented Aug 16, 2024

Better concatenation and individual metrics when using multiple text datasets #22

Better concatenation and individual metrics when using multiple text datasets #22

Comments

bhavul commented Oct 9, 2023

Shravya-Kasturi commented Dec 21, 2023

pritam5756 commented Jun 10, 2024

harshsikka commented Jun 11, 2024

pritam5756 commented Jun 11, 2024

pritam5756 commented Aug 16, 2024