Skip to content

Latest commit

 

History

History
72 lines (58 loc) · 4.93 KB

dataset.md

File metadata and controls

72 lines (58 loc) · 4.93 KB

Section 11: Datasets for LLM Training

  • LLM-generated datasets:
    1. Self-Instruct: [cnt]: Seed task pool with a set of human-written instructions. [20 Dec 2022]
    2. Self-Alignment with Instruction Backtranslation: [cnt]: Without human seeding, use LLM to produce instruction-response pairs. The process involves two steps: self-augmentation and self-curation. [11 Aug 2023]
  • LLMDataHub: Awesome Datasets for LLM Training: A quick guide (especially) for trending instruction finetuning datasets GitHub Repo stars
  • Open LLMs and Datasets: A list of open LLMs available for commercial use. GitHub Repo stars
  • SQuAD: The Stanford Question Answering Dataset (SQuAD), a set of Wikipedia articles, 100,000+ question-answer pairs on 500+ articles. [16 Jun 2016]
  • RedPajama: LLaMA training dataset of over 1.2 trillion tokens git [17 Apr 2023] GitHub Repo stars
  • FineWeb: HuggingFace: crawled 15 trillion tokens of high-quality web data from the summer of 2013 to March 2024. [Apr 2024]
  • MS MARCO Web Search: A large-scale information-rich web dataset, featuring millions of real clicked query-document labels [Apr 2024] GitHub Repo stars
  • Synthetic Data of LLMs: A reading list on LLM based Synthetic Data Generation [Oct 2024] GitHub Repo stars

Pretrain for a base model

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}

databricks-dolly-15k: Instruction-Tuned git: SFT training - QA pairs or Dialog

{
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris."
},
{
    "prompt": "Can you give me a recipe for chocolate chip cookies?",
    "response": "Sure! ..."
}

Anthropic human-feedback: RLHF training - Chosen and Rejected pairs

{
  "chosen": "I'm sorry to hear that. Is there anything I can do to help?",
  "rejected": "That's too bad. You should just get over it."
}
  • Dataset example

    cite

    SFT Dataset

    Category Instruction Context Response
    0 Open QA How do I get rid of mosquitos in my house? You can get rid of mosquitos in your house by ...
    1 Classification Classify each country as "African" or "European" Nigeria: African
    Rwanda: African
    Portugal: European
    2 Information Extraction Extract the unique names of composers from the text. To some extent, European and the US traditions... Pierre Boulez, Luigi Nono, Karlheinz Stockhausen
    3 General QA Should investors time the market? Timing the market is based on predictions of t...

    RLHF Dataset

    Instruction Chosen Response Rejected Response
    What is Depreciation Depreciation is the drop in value of an asset ... What is Depreciation – 10 Important Facts to K...
    What do you know about the city of Aberdeen in Scotland? Aberdeen is a city located in the North East of Scotland. It is known for its granite architecture and its offshore oil industry. As an AI language model, I don't have personal knowledge or experiences about Aberdeen.
    Describe thunderstorm season in the United States and Canada. Thunderstorm season in the United States and Canada typically occurs during the spring and summer months, when warm, moist air collides with cooler, drier air, creating the conditions for thunderstorms to form. Describe thunderstorm season in the United States and Canada.