v0 Dataset and Benchmark Specification #58

harshsikka · 2023-12-11T20:57:42Z

Context:

@snat-s has done great work w/ the analysis of various data that may be relevant to Neko

We should now, w/ the input of the team, finalize our proposed V0 dataset, justify its

Output: document w/ v0 dataset proposal, justifications, and experimental plan to benchmark (and ultimately train Neko v0)

harshsikka · 2023-12-11T20:58:59Z

Let's shoot to have a preliminary proposal ready before Sunday, so we can share w/ team for feedback.

snat-s · 2024-01-07T01:17:47Z

Here is v0 for the language and images.

Dataset proposal: Vision and Language

We are trying to replicate to the most accurate extense the original Gato paper. For the Vision and Language tasks, we have a lot of datasets that we can already use. The following is context and the alternatives. I have left my personal choices at the end to not cloud judgment on what we should pick for the final run.

MassiveText is a proprietary dataset made by Google that is not open source. The alternatives are the following:

The pile: A dataset made by EleutherAI. It is 800GB of high quality data. It compiles several other datasets inside including: OpenWebText2, Wikipedia, Github, StackExchange among others. There are two problems that I could find
1. it is a bit old approx. 2020.
2. People have said that more common crawl would be better (see this conversation on reddit.
RedPajama V2: A dataset made by TogetherAi. It is based from 84 CommonCrawl dumps and deduplicated. It is a lot bigger with 20.5T Trillion tokens (for english). Recently the TinyLLama trained on v1. The bad things about this dataset is that we would have to wrangle with removing deduplication if we want a cleaner dataset (even though it does not seem to matter). I am not sure if more diversity in data would be nicer. It is only made of CommonCrawl.

M3W: Another proprietary dataset from Google made of Documents paired with images. There are two alternatives we could use:

OBELICS: Massive dataset from Huggingface (666 GB). It is made out of Common Crawl dumps and it is currently. The pros of this dataset is that it has more unique images and total number of tokens. It has more perplexity and has a lot of deduplication.
Multimodal C4 and Multimodal C4 Core: From AllenAI. It is was the first attempt at generating something similar to document + image generation. If we are trying to do test runs with a smaller dataset. MMC4-core can be a good option.

ALIGN: A proprietary image-caption dataset inside of Google. Three different options exist in here:

COYO-700M: This dataset has been used to get the performance of the original ALIGN paper (they used more things besides this dataset like COCO and CC3M).
COYO-300M: It is a smaller computer labeled dataset aimed to replicate JFT-300M.
LAION-2B: Originally, LAION released a dataset with 5B images + captions. It was not that high quality, so they narrowed it down to the best images. In size, it is the one that most approximate datasets in size because ALIGN is supposedly 1.8B images.

MS-COCO: Open data.

Conceptual Captions: Open data

LTIP: Close sourced from Google. An image text pair dataset of approximately 300M images. The alternatives are the ones from ALIGN.

OKVQA: It’s open data, I would recommend to use instead the AOKVQA successor

VQA-V2: Open data.

My choices

MassiveText -> Red PajamaV2
M3W -> OBELICS (it's more diverse than MMC4).
ALIGN -> LAION 2B (Approximates in size)
MS-COCO (open source)
Conceptual Captions (open source)
LTIP->COYO700M
OKVQA->AOKVQA (open source, an augmented version of OKVQA)
VQAV2 (open source)

A last comment would be that currently this dataset would be for pretraining of the Large MultiModal model. It would be interesting to generate a dataset to align this models. We could also use the datasets from something like LlaVa (Multimodal Instrucion-Following Data).

snat-s · 2024-01-16T17:21:52Z

For the control datasets, we can use this preprocessing: Torch-X-Embodiment to pass from the tensorflow models pytorch compatible ones.

snat-s · 2024-01-16T19:23:39Z

For a subset/ a mini version that we could use to do pretrain test runs we could use the following as a dataset:

MassiveText -> There are three highquality options: MiniPile (6GB), C4 (16GB) and OpenWebText (54GB)
M3W -> Fewer Faces MultiModal C4 Core (that's a mouthfull ~33GB only text)
ALIGN + LTIP ->COYO700M

MS-COCO (open source)
Conceptual Captions (open source)
OKVQA->AOKVQA (open source, an augmented version of OKVQA)
VQAV2 (open source)

harshsikka added this to NEKO Project Dec 11, 2023

harshsikka converted this from a draft issue Dec 11, 2023

harshsikka assigned harshsikka and snat-s Dec 11, 2023

harshsikka mentioned this issue Dec 11, 2023

Outline Control Modality Next Steps across Data, Training #59

Open

snat-s mentioned this issue Jan 7, 2024

Evaluation of datasets used for flamingo #31

Closed

BobakBagheri moved this from Backlog to To Review in NEKO Project Jan 7, 2024

This was referenced Jan 15, 2024

Investigate Alternative Datasets: LTIP #51

Closed

Dataset Reconcilliation #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0 Dataset and Benchmark Specification #58

v0 Dataset and Benchmark Specification #58

harshsikka commented Dec 11, 2023

harshsikka commented Dec 11, 2023

snat-s commented Jan 7, 2024 •

edited

Loading

snat-s commented Jan 16, 2024

snat-s commented Jan 16, 2024 •

edited

Loading

v0 Dataset and Benchmark Specification #58

v0 Dataset and Benchmark Specification #58

Comments

harshsikka commented Dec 11, 2023

harshsikka commented Dec 11, 2023

snat-s commented Jan 7, 2024 • edited Loading

Dataset proposal: Vision and Language

My choices

snat-s commented Jan 16, 2024

snat-s commented Jan 16, 2024 • edited Loading

snat-s commented Jan 7, 2024 •

edited

Loading

snat-s commented Jan 16, 2024 •

edited

Loading