Skip to content

Commit

Permalink
Update dataset.md (#314)
Browse files Browse the repository at this point in the history
* Update datasets.md

* Update datasets.md
  • Loading branch information
wanng-ide authored Apr 25, 2024
1 parent 39bd128 commit 3155119
Showing 1 changed file with 14 additions and 0 deletions.
14 changes: 14 additions & 0 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,17 @@ The dataset is proposed for super-resolution tasks. We use the dataset for HQ fi
[HD-VG-130M](https://github.com/daooshee/HD-VG-130M?tab=readme-ov-file) comprises 130M text-video pairs.
The caption is generated by BLIP-2.
We find the scene and the text quality are relatively poor. For OpenSora 1.0, we only use ~350K samples from this dataset.

## Midjourney-v5-1.7M
[Midjourney-v5-1.7M](https://huggingface.co/datasets/wanng/midjourney-v5-202304-clean) includes 1.7M image-text pairs.
In detail, this dataset introduces two subsets: original and upscale.
This dataset is proposed for exploring the relationship of prompts and high-quality images.

## Midjourney-kaggle-clean
[Midjourney-kaggle-clean](https://huggingface.co/datasets/wanng/midjourney-kaggle-clean) is a reconstructed version of [Midjourney User Prompts & Generated Images (250k)](https://www.kaggle.com/datasets/succinctlyai/midjourney-texttoimage?select=general-01_2022_06_20.json%5D), which is cleaned by rules.
Moreover, this dataset is divided into two subsets: original and upscale.
This dataset is proposed for enabling research on text-to-image model prompting.

## upsplash-lite
The [Unsplash-lite](https://github.com/unsplash/datasets) Dataset comprises 25k nature-themed Unsplash photos, 25k keywords, and 1M searches.
This dataset covers a vast range of uses and contexts. Its extensive scope in intent and semantics opens new avenues for research and learning.

0 comments on commit 3155119

Please sign in to comment.