-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0 Dataset and Benchmark Specification #58
Comments
Let's shoot to have a preliminary proposal ready before Sunday, so we can share w/ team for feedback. |
Here is v0 for the language and images. Dataset proposal: Vision and LanguageWe are trying to replicate to the most accurate extense the original Gato paper. For the Vision and Language tasks, we have a lot of datasets that we can already use. The following is context and the alternatives. I have left my personal choices at the end to not cloud judgment on what we should pick for the final run. MassiveText is a proprietary dataset made by Google that is not open source. The alternatives are the following:
M3W: Another proprietary dataset from Google made of Documents paired with images. There are two alternatives we could use:
ALIGN: A proprietary image-caption dataset inside of Google. Three different options exist in here:
MS-COCO: Open data. Conceptual Captions: Open data LTIP: Close sourced from Google. An image text pair dataset of approximately 300M images. The alternatives are the ones from ALIGN. OKVQA: It’s open data, I would recommend to use instead the AOKVQA successor VQA-V2: Open data. My choicesMassiveText -> Red PajamaV2 A last comment would be that currently this dataset would be for pretraining of the Large MultiModal model. It would be interesting to generate a dataset to align this models. We could also use the datasets from something like LlaVa (Multimodal Instrucion-Following Data). |
For the control datasets, we can use this preprocessing: Torch-X-Embodiment to pass from the tensorflow models pytorch compatible ones. |
For a subset/ a mini version that we could use to do pretrain test runs we could use the following as a dataset: MassiveText -> There are three highquality options: MiniPile (6GB), C4 (16GB) and OpenWebText (54GB) MS-COCO (open source) |
Context:
@snat-s has done great work w/ the analysis of various data that may be relevant to Neko
We should now, w/ the input of the team, finalize our proposed V0 dataset, justify its
Output: document w/ v0 dataset proposal, justifications, and experimental plan to benchmark (and ultimately train Neko v0)
The text was updated successfully, but these errors were encountered: