Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0 Dataset and Benchmark Specification #58

Open
harshsikka opened this issue Dec 11, 2023 · 4 comments
Open

v0 Dataset and Benchmark Specification #58

harshsikka opened this issue Dec 11, 2023 · 4 comments
Assignees

Comments

@harshsikka
Copy link
Member

Context:

@snat-s has done great work w/ the analysis of various data that may be relevant to Neko

We should now, w/ the input of the team, finalize our proposed V0 dataset, justify its

Output: document w/ v0 dataset proposal, justifications, and experimental plan to benchmark (and ultimately train Neko v0)

@harshsikka harshsikka converted this from a draft issue Dec 11, 2023
@harshsikka
Copy link
Member Author

Let's shoot to have a preliminary proposal ready before Sunday, so we can share w/ team for feedback.

@snat-s
Copy link

snat-s commented Jan 7, 2024

Here is v0 for the language and images.

Dataset proposal: Vision and Language

We are trying to replicate to the most accurate extense the original Gato paper. For the Vision and Language tasks, we have a lot of datasets that we can already use. The following is context and the alternatives. I have left my personal choices at the end to not cloud judgment on what we should pick for the final run.

MassiveText is a proprietary dataset made by Google that is not open source. The alternatives are the following:

  • The pile: A dataset made by EleutherAI. It is 800GB of high quality data. It compiles several other datasets inside including: OpenWebText2, Wikipedia, Github, StackExchange among others. There are two problems that I could find

    1. it is a bit old approx. 2020.
    2. People have said that more common crawl would be better (see this conversation on reddit.
  • RedPajama V2: A dataset made by TogetherAi. It is based from 84 CommonCrawl dumps and deduplicated. It is a lot bigger with 20.5T Trillion tokens (for english). Recently the TinyLLama trained on v1. The bad things about this dataset is that we would have to wrangle with removing deduplication if we want a cleaner dataset (even though it does not seem to matter). I am not sure if more diversity in data would be nicer. It is only made of CommonCrawl.

M3W: Another proprietary dataset from Google made of Documents paired with images. There are two alternatives we could use:

  • OBELICS: Massive dataset from Huggingface (666 GB). It is made out of Common Crawl dumps and it is currently. The pros of this dataset is that it has more unique images and total number of tokens. It has more perplexity and has a lot of deduplication.
  • Multimodal C4 and Multimodal C4 Core: From AllenAI. It is was the first attempt at generating something similar to document + image generation. If we are trying to do test runs with a smaller dataset. MMC4-core can be a good option.

ALIGN: A proprietary image-caption dataset inside of Google. Three different options exist in here:

  • COYO-700M: This dataset has been used to get the performance of the original ALIGN paper (they used more things besides this dataset like COCO and CC3M).
  • COYO-300M: It is a smaller computer labeled dataset aimed to replicate JFT-300M.
  • LAION-2B: Originally, LAION released a dataset with 5B images + captions. It was not that high quality, so they narrowed it down to the best images. In size, it is the one that most approximate datasets in size because ALIGN is supposedly 1.8B images.

MS-COCO: Open data.

Conceptual Captions: Open data

LTIP: Close sourced from Google. An image text pair dataset of approximately 300M images. The alternatives are the ones from ALIGN.

OKVQA: It’s open data, I would recommend to use instead the AOKVQA successor

VQA-V2: Open data.

My choices

MassiveText -> Red PajamaV2
M3W -> OBELICS (it's more diverse than MMC4).
ALIGN -> LAION 2B (Approximates in size)
MS-COCO (open source)
Conceptual Captions (open source)
LTIP->COYO700M
OKVQA->AOKVQA (open source, an augmented version of OKVQA)
VQAV2 (open source)

A last comment would be that currently this dataset would be for pretraining of the Large MultiModal model. It would be interesting to generate a dataset to align this models. We could also use the datasets from something like LlaVa (Multimodal Instrucion-Following Data).

@snat-s
Copy link

snat-s commented Jan 16, 2024

For the control datasets, we can use this preprocessing: Torch-X-Embodiment to pass from the tensorflow models pytorch compatible ones.

@snat-s
Copy link

snat-s commented Jan 16, 2024

For a subset/ a mini version that we could use to do pretrain test runs we could use the following as a dataset:

MassiveText -> There are three highquality options: MiniPile (6GB), C4 (16GB) and OpenWebText (54GB)
M3W -> Fewer Faces MultiModal C4 Core (that's a mouthfull ~33GB only text)
ALIGN + LTIP ->COYO700M

MS-COCO (open source)
Conceptual Captions (open source)
OKVQA->AOKVQA (open source, an augmented version of OKVQA)
VQAV2 (open source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Review
Development

No branches or pull requests

2 participants