Skip to content

Latest commit

 

History

History
117 lines (78 loc) · 4.28 KB

README.md

File metadata and controls

117 lines (78 loc) · 4.28 KB

Vista

Data License

image/png

"700.000 Vietnamese vision-language samples open-source dataset"

Overview

This dataset contains over 700,000 Vietnamese vision-language samples, created by Gemini Pro. We employed several prompt engineering techniques: few-shot learning, caption-based prompting and image-based prompting.

  • For the COCO dataset, we generated data using Llava-style prompts

  • For the ShareGPT4V dataset, we used translation prompts.

  • Caption-based prompting: involves using accurate captions and bounding boxes from the original dataset.

  • Image-based prompting: uses images to create captions and conversations.

Curation process involved removing any Han, Japanese, and Korean characters. The data was also refined by filtering out samples with high perplexity levels.

image/png

image/svg

HuggingFace Dataset

Report: Coming Soon

Dataset Structure

The dataset is structured into 5 subsets:

Subset Split Method Size
Vi-LLAVA conversation train caption-based 107,052
validation 4,550
Vi-LLAVA complex reasoning train caption-based 112,650
validation 4,771
Vi-LLAVA detail description train caption-based 111,153
validation 4,714
Vi-ShareGPT4V translation 96,913
Vi-WIT caption-based, image-based 264,831
Total 706,634

Data process

Vi-LLAVA

Follow the instructions in Vi-LLAVA/ folder.

Translate ShareGPT4V

bash scripts/translate_shareGPT4V.sh

WIT

Follow the instructions in WIT/ folder.

Filtering perplexity

Open In Colab

from perplexity.filtering import FilteringPerplexity

# Specific your own dataset
datasets = load_dataset("Specific your dataset", split="train")

# Set up perplextiy filtering
perplexity_filtering = FilteringPerplexity(
    sentencepiece_model_path=os.path.join('path to sentencepiece model'),
    kenlm_model_path=os.path.join("path to kenlm model"),
)

# Compute perplexity
data_contains_perplex = perplexity_filtering.compute(dataset)

# Filter perplexity
threshold = 100  # Set your own threshold if needed
data_filtered = perplexity_filtering.filter(data_contains_perplex, threshold=threshold)

Personal and Sensitive Information

  • The dataset does not contain any personal or sensitive information.

Bias, Risks, and Limitations

  • The dataset may contain biases due to the sources from which the data was collected.
  • Users should be aware of these potential biases when using the dataset.

Authors

Licensing Information

The dataset is released under the MIT license.

Additional Information

Citation Information

BibTeX:

@article{ViVLM Vista 2024,
  title={Vista},
  author={Tran, Oanh Ngoc and Bui, Hop Van and Ha, Hoang Huy and Phan, Phuc Van},
  year=2024,
  month=May},
  url={https://huggingface.co/datasets/Vi-VLM/Vista}