What is this?

Megatron's multi-modal data loader

Megatron Energon

Report Bug · Request Feature

DISCLAIMER: This package contains research code. APIs may change.

What is this?

Megatron Energon is the multi-modal data loader of Megatron (you can also use it independently).

It's best at

loading large training data to train large multi-modal models
blending many different datasets together
distributing the work across many nodes and processes of a cluster
ensuring reproducibility and resumability
adapting easily to various types of data samples and processing

Try using it together with Megatron Core.

Quickstart

Megatron Energon is a pip-installable python package that offers

dataset-related classes that you can import in your project
a command line utility for data preprocessing and conversion

This document is just a quick start. Please also check out the documentation.

Installation

pip install megatron-energon

Or

pip install git+https://github.com/NVIDIA/Megatron-Energon.git

NOTE: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.

For more details on installing this package, see here.

Usage of command line tool

After installation, the command energon will be available.

Here are some examples for things you can do:

Command	Description
`energon prepare DATASET_ROOT`	Take an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset
`energon lint DATASET_ROOT`	Verify that the dataset complies with the energon dataset format and that all samples are loadable

Usage of the library

To get started, pick a WebDataset-compliant dataset and run energon prepare DATASET_ROOT on it, to run the interactive assistant and create the .nv-meta folder.

Once done, try to load it from your Python program:

from megatron.energon import get_train_dataset, get_loader


train_loader = get_loader(get_train_dataset(
    '/my/dataset/path',
    batch_size=32,
    shuffle_buffer_size=None,
    max_samples_per_sequence=None,
))

for batch in train_loader:
    # Do something with batch
    # Infer, gradient step, ...
    pass

For more details, read the documentation.

Most likely, you'll need your own task encoder.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
docker		docker
docs		docs
src/megatron/energon		src/megatron/energon
tests		tests
.coveragerc		.coveragerc
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Megatron's multi-modal data loader

Megatron Energon

What is this?

Quickstart

Installation

Usage of command line tool

Usage of the library

About

Releases 5

Packages

Contributors 3

Languages

License

NVIDIA/Megatron-Energon

Folders and files

Latest commit

History

Repository files navigation

Megatron's multi-modal data loader

Megatron Energon

What is this?

Quickstart

Installation

Usage of command line tool

Usage of the library

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 3

Languages

Packages