dataset_info

features

splits

download_size

dataset_size

annotations_creators

language_creators

language

license

multilinguality

pretty_name

size_categories

source_datasets

task_categories

task_ids

paperswithcode_id

name	sequence
input_ids	int32

name	num_bytes	num_examples
train	22274051772	43166767

12187746609

22274051772

no-annotation

found

en

other

monolingual

pretokenized,filtered,sorted subset of the Pile

10B<n<100B

the-pile

text-generation

fill-mask

language-modeling

masked-language-modeling

the-pile-cramming

Dataset Card for "the_pile_WordPiecex32768_97b8e776baafb99c3892e6572a9f51b3"

Dataset Description

Repository: https://github.com/JonasGeiping/cramming
Paper: https://arxiv.org/abs/2212.14034
Raw Data Source Paper: The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Raw Data Source Datasheet: Datasheet for the Pile

Dataset Summary

This is a preprocessed, tokenized dataset for the cramming-project.

Use only with the tokenizer uploaded here. This version is 97b8e776baafb99c3892e6572a9f51b3, which corresponds to a specific dataset construction setup, described below. The raw data source is the Pile, a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Languages

This dataset is in English (EN).

Data Splits

This preprocessed subset contains only a train split.

Dataset Creation

The configuration to create this dataset with the cramming project code (https://github.com/JonasGeiping/cramming) is

# This is a slice of the pile, loaded from a local source
name: the_pile
defaults:
  - sources:
      - the_pile

#
# Preprocessing
normalizer:
  force_lowercase: True
  strip_accents: True
  force_english_keyboard: True
  whitespace_escape: False
tokenizer: WordPiece
vocab_size: 32768

# Dataset Formation
seq_length: 128
include_cls_token_in_corpus: False
include_sep_token_in_corpus: True
use_type_ids: False
max_entries_in_raw_dataset: 16e6 # About 40 mio seqs of length 128
max_seq_in_tokenized_dataset: 85e6 # Select only this many tokenized sequences.
# max_seq_in_tokenized_dataset should be just slightly more than budget * 60 * 60 * expected tokens/sec for the single epoch of training

# Data Cleaning:
named_entity_simplification: False
remove_whitespaces: False
remove_trash: True
trash_cutoff: 0.25
deduplicate_entries: False
deduplication_threshold: 75

# Data Order:
ordering: sentence-length-curriculum # could be a curriculum

Considerations for Using the Data

Limitations and bias: This training data was further filtered and sorted beyond the normal preprocessing. These modifications were not tested for unintended consequences.

Additional Information

Dataset Curators

This dataset is a filtered, sorted and preprocessed subset of the the-Pile made by Jonas Geiping . The original dataset was primarily curated by Leo Gao and Stella Biderman, with assistance from other authors of the Pile paper.

Licensing Information

Please refer to the specific license depending on the subset you use at https://huggingface.co/datasets/EleutherAI/pile

Citation Information

@article{gao2020pile,
  title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}
@article{biderman2022datasheet,
  title={Datasheet for the pile},
  author={Biderman, Stella and Bicheno, Kieran and Gao, Leo},
  journal={arXiv preprint arXiv:2201.07311},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_card.md

data_card.md

Dataset Card for "the_pile_WordPiecex32768_97b8e776baafb99c3892e6572a9f51b3"

Dataset Description

Dataset Summary

Languages

Data Splits

Dataset Creation

Considerations for Using the Data

Additional Information

Dataset Curators

Licensing Information

Citation Information

Files

data_card.md

Latest commit

History

data_card.md

File metadata and controls

Dataset Card for "the_pile_WordPiecex32768_97b8e776baafb99c3892e6572a9f51b3"

Dataset Description

Dataset Summary

Languages

Data Splits

Dataset Creation

Considerations for Using the Data

Additional Information

Dataset Curators

Licensing Information

Citation Information