Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
examples		examples
README.md		README.md
data-format.md		data-format.md
deduplication.md		deduplication.md
develop.md		develop.md
getting-started.md		getting-started.md
mixer.md		mixer.md
parallel-processor.md		parallel-processor.md
taggers.md		taggers.md
tokenize.md		tokenize.md

README.md

Dolma Toolkit Documentation

The Dolma toolkit enables dataset curation for pretraining AI models. Reason to use the Dolma toolkit are:

High performance ⚡️ Dolma toolkit is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
Portable 🧳 Dolma toolkit can be run on a single machine, a cluster, or a cloud computing environment.
Built-in taggers 🏷 Dolma toolkit comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create Gopher and C4.
Fast deduplication 🗑 Dolma toolkit can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
Extensible 🧩 Dolma toolkit is designed to be extensible, and can be extended with custom taggers.
Cloud support ☁️ Dolma toolkit supports reading and write data from local disk, and AWS S3-compatible locations.

Dataset curation with the Dolma toolkit usually happens in four steps:

Using taggers, spans of documents in a dataset are tagged with properties (e.g. the language their are in, toxicity, etc);
Documents are optionally deduplicated based on their content or metadata;
Using the mixer, documents removed or filtered depending on value of attributes;
Finally, documents can be tokenized using any HuggingFace-compatible tokenizer.

Dolma toolkit can be installed using pip:

pip install dolma

Dolma toolkit can be used either as a Python library or as a command line tool. The command line tool can be accessed using the dolma command. To see the available commands, use the --help flag.

dolma --help

Index

To read Dolma toolkit's documentation, visit the following pages:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

README.md

Dolma Toolkit Documentation

Index

Files

docs

Directory actions

More options

Directory actions

More options

Latest commit

History

docs

Folders and files

parent directory

README.md

Dolma Toolkit Documentation

Index