Skip to content

Latest commit

 

History

History
429 lines (322 loc) · 39.3 KB

Submission_guidelines.md

File metadata and controls

429 lines (322 loc) · 39.3 KB

MLPerf™ Storage V1.0 Benchmark Rules

——————————————————————————————————————————

1. Overview

MLPerf™ Storage is a benchmark suite to characterize the performance of storage systems that support machine learning workloads. MLPerf Storage does not require running of the actual training jobs.

Thus, submitters do not need to use hardware accelerators (e.g., GPUs, TPUs, and other ASICs) when running MLPerf Storage.

Instead, our benchmark tool replaces the training on the accelerator for a single batch of data with a sleep() call. The sleep() interval depends on the batch size and accelerator type and has been determined through measurement on a system running the actual training workload. The rest of the data ingestion pipeline (data loading, caching, checkpointing) is unchanged and runs in the same way as when the actual training is performed.

There are two main advantages to accelerator emulation. First, MLPerf Storage allows testing different storage systems with different types of accelerators. To change the type of accelerator that the benchmark emulates (e.g., to switch to a system with NVIDIA H100 GPUs instead of A100 GPUs), it is enough to adjust the value of the sleep() parameter. The second advantage is that MLPerf Storage can put a high load on the storage system simply by increasing the number of emulated accelerators. This effectively allows for testing the behavior of the storage system in large-scale scenarios without purchasing/renting the commensurate compute infrastructure.

This version of the benchmark does not include offline or online data pre-processing. We are aware that data pre-processing is an important part of the ML data pipeline and we will include it in a future version of the benchmark.

This benchmark attempts to balance two goals. First, we aim for comparability between benchmark submissions to enable decision making by the AI/ML Community. Second, we aim for flexibility to enable experimentation and to show off unique storage system features that will benefit the AI/ML Community. To that end we have defined two classes of submissions: CLOSED and OPEN. The MLPerf name and logo are trademarks of the MLCommons® Association ("MLCommons"). In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. MLCommons reserves the right to solely determine if a use of its name or logos is acceptable.

Timeline

Date Description
Jun 26, 2024 Freeze rules & benchmark code.
Aug 7, 2024 Open benchmark for submissions.
Aug 21, 2024 Submissions due.
Aug 21, 2024 - Sep 11, 2024 Review period.
Sep 11, 2024 Benchmark competition results are published.

Benchmarks

The benchmark suite provides workload configurations that simulate the I/O patterns of selected workloads listed in Table 1. The I/O patterns for each MLPerf Storage benchmark correspond to the I/O patterns of the MLPerf Training and MLPerf HPC benchmarks (i.e., the I/O generated by our tool for 3D U-Net closely follows the I/O generated by actually running the 3D U-Net training workload). The benchmark suite can also generate synthetic datasets which show the same I/O load as the actual datasets listed in Table 1.

Area Problem Model Data Loader Dataset seed Minimum AU%
Vision Image segmentation (medical) 3D U-Net PyTorch KiTS 19 (140MB/sample) 90%
Vision Image classification ResNet-50 TensorFlow ImageNet (150KB/sample) 90%
Scientific Cosmology parameter prediction TensorFlow CosmoFlow N-body simulation (2MB/sample) 70%

Table 1: Benchmark description

  • Benchmark start point: The dataset is in shared persistent storage.
  • Benchmark end point: The measurement ends after a predetermined number of epochs. Note: data transfers from storage in this test terminate with the data in host DRAM; transfering data into the accelerator memory is not included in this benchmark.
  • Configuration files for the workloads and dataset content can be found here.

Definitions

The following definitions are used throughout this document:

  • A sample is the unit of data on which training is run, e.g., an image, or a sentence.

  • A step is defined to be the first batch of data loaded into the (emulated) accelerator.

  • Accelerator Utilization (AU) is defined as the percentage of time taken by the simulated accelerators, relative to the total benchmark running time. Higher is better.

  • A division is a set of rules for implementing benchmarks from a suite to produce a class of comparable results. MLPerf Storage allows CLOSED and OPEN divisions, detailed in Section 6.

  • DLIO (code link, paper link) is a benchmarking tool for deep learning applications. DLIO is the core of the MLPerf Storage benchmark and with specified configurations will emulate the I/O pattern for the workloads listed in Table 1. MLPerf Storage provides wrapper scripts to launch DLIO. There is no need to know the internals of DLIO to do a CLOSED submission, as the wrapper scripts provided by MLPerf Storage will suffice. However, for OPEN submissions changes to the DLIO code might be required (e.g., to add custom data loaders).

  • Dataset content refers to the data and the total capacity of the data, not the format of how the data is stored. Specific information on dataset content can be found here.

  • Dataset format refers to the format in which the training data is stored (e.g., npz, hdf5, csv, png, tfrecord, etc.), not the content or total capacity of the dataset.

    NOTE: we plan to add support for Object storage in a future version of the benchmark, so OPEN submissions that include benchmark application changes and a description of how the original MLPerf Training benchmark dataset was mapped into Objects will be appreciated.

  • A storage system consists of a defined set of hardware and software resources that provide storage services to one or more host nodes. Storage systems can be hardware based, software-defined, virtualized or cloud based, and must be capable of providing the minimum storage services required to run the benchmark.

  • A storage scaling unit is defined as the minimum unit by which the performance and scale of a storage system can be increased. Examples of storage scaling units are “nodes”, “controllers”, “virtual machines” or “shelves”. Benchmark runs with different numbers of storage scaling units allow a reviewer to evaluate how well a given storage solution is able to scale as more scaling units are added.

  • A host node is defined as the minimum unit by which the load upon the storage system under test can be increased. Every host node must run the same number of simulated accelerators. A host node can be instantiated by running the MLPerf Storage benchmark code within a Container or within a VM guest image or natively within an entire physical system. The number of Containers or VM guest images per physical system and the CPU resources per host node is up to the submitter. Note that the maximum DRAM available to any host node must be used when calculating the dataset size to be generated for the test.

  • An ML framework is a specific version of a software library or set of related libraries for training ML models using a system. Examples include specific versions of Caffe2, MXNet, PaddlePaddle, PyTorch, or TensorFlow.

  • A benchmark is an abstract problem that can be solved using ML by training a model based on a specific dataset or simulation environment to a target quality level.

  • A reference implementation is a specific implementation of a benchmark provided by the MLPerf organization.

  • A benchmark implementation is an implementation of a benchmark in a particular framework by a user under the rules of a specific division.

  • A run is a complete execution of a benchmark implementation on a system.

  • A benchmark result is the mean of 5 run results, executed consecutively. The dataset is generated only once for the 5 runs, prior to those runs. The 5 runs must be done on the same machine(s).

  • The storage system under test must be described via one of the following storage system access types. The overall solution might support more than one of the below types, but any given benchmark submission must be described by the access type that was actually used during that submission. Specifically, this is reflected in the system-name.json file, in the storage_system→solution_type, the storage_system→software_defined and storage_system→hyperconverged fields, and the networks→protocols fields. An optional vendor-specified qualifier may be specified. This will be displayed in the results table after the storage system access type, for example, “NAS - RDMA”.

    • Direct-attached media – any solution using local media on the host node(s); eg: NVMe-attached storage with a local filesystem layered over it. This will be abbreviated “Local” in the results table.
    • Remotely-attached block device – any solution using remote block storage; eg: a SAN using FibreChannel, iSCSI, NVMeoF, NVMeoF over RDMA, etc, with a local filesystem implementation layered over it. This will be abbreviated “Remote Block” in the results table.
    • Shared filesystem using a standards-defined access protocol – any solution using a version of standard NFS or CIFS/SMB to access storage. This will be abbreviated “NAS” in the results table.
    • Shared filesystem using a proprietary access protocol – any network-shared filesystem solution that requires a unique/proprietary protocol implementation to be installed on the host node(s) to access storage; eg: an HPC parallel filesystem. This will be abbreviated “Proprietary” in the results table.
    • Object – any solution accessed using an object protocol such as S3, RADOS, etc. This will be abbreviated “Object” in the results table.
    • Other – any solution whose access is not sufficiently described by the above categories. This will be abbreviated “Other” in the results table.

Performance Metrics

The benchmark performance metric is samples per second, subject to a minimum accelerator utilization (AU) defined for that workload. Higher samples per second is better.

To pass a benchmark run, the AU should be equal to or greater than the minimum value, and is computed as follows:

AU (percentage) = (total_compute_time/total_benchmark_running_time) * 100

All the I/O operations from the first step are excluded from the AU calculation in order to avoid the disturbance in the averages caused by the startup costs of the data processing pipeline, allowing the AU to more-quickly converge on the steady-state performance of the pipeline. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however.

If all I/O operations are hidden by compute time, then the total_compute_time will equal the total_benchmark_running_time and the AU will be 100%.

The total compute time can be derived from the batch size, total dataset size, number of simulated accelerators, and sleep time:

total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs.

NOTE: The sleep time has been determined by running the actual MLPerf training workloads including the compute step on real hardware and is dependent on the accelerator type. In this version of the benchmark we include sleep times for NVIDIA A100 and H100 GPUs. We plan on expanding the measurements to different accelerator types in future releases.

Benchmark Code

The MLPerf Storage working group provides a benchmark implementation which includes:

  • Scripts to determine the minimum dataset size required for your system, for a given benchmark.
  • Scripts for data generation.
  • Benchmark tool, based on DLIO, with configuration files for the benchmarks.
  • A script for running the benchmark on one host (additional setup is required if you are running a distributed training benchmark – see Section 5).
  • A script for generating the results report (additional scripting and setup may be required if you are running a distributed training benchmark – see Section 5), and potentially additional supporting scripts.

More details on installation and running the benchmark can be found in the Github repo

2. General Rules

The following apply to all results submitted for this benchmark.

2.1. Strive to be fair

Benchmarking should be conducted to measure the framework and storage system performance as fairly as possible. Ethics and reputation matter.

2.2. System and framework must be available

  • Available Systems. If you are measuring the performance of a publicly available and widely-used system or framework, you must use publicly available and widely-used versions of the system or framework. This class of systems will be called Available Systems, and availability here means the system is a publicly available commercial storage system. If you are measuring the performance of a system that is not available at the time of the benchmark results submission, the system must become commercially available within 6 months from results publication. Otherwise, the results for that submission will be retracted from the MLCommons results dashboard.
  • RDI Systems. If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication by MLCommons. This class of systems will be called RDI (research, development, internal).

2.3 Non-determinism

The data generator in DLIO uses a fixed random seed that must not be changed, to ensure that all submissions are working with the same dataset. Random number generators may be seeded from the following sources:

  • Clock
  • System source of randomness, e.g. /dev/random or /dev/urandom
  • Another random number generator initialized with an allowed seed Random number generators may be initialized repeatedly in multiple processes or threads. For a single run, the same seed may be shared across multiple processes or threads.

2.4. Result rounding

Public results should be rounded normally, to two decimal places.

2.5. Stable storage must be used

The MLPerf Storage benchmark will create the dataset on the storage system, in the desired dataset format, before the start of the benchmark run. The data must reside on stable storage before the actual benchmark testing can run.

2.6. Caching

Under all circumstances, caching of training data on the host node(s) running MLPerf Storage before the benchmark begins is DISALLOWED. Caches in the host node(s) must be cleared between two consecutive benchmark runs.

On the one hand, we have sized the benchmark dataset to be 5x the size of DRAM on the host node(s) running the benchmark code so that the randomness of the access pattern can defeat any significant levels of caching in local DRAM. In that sense cache invalidation in the benchmark nodes should not be required, but out of an abundance of caution we do require it. On the other hand, we believed that repeated real-world training runs with a given storage system might benefit from caching in the storage system under normal circumstances, so we have not required cache invalidation there.

2.7. Replicability is mandatory

Results that cannot be replicated are not valid results. Replicated results should be within 5% within 5 tries.

3. Dataset Generation

MLPerf Storage uses DLIO to generate synthetic data. Instructions on how to generate the datasets for each benchmark are available here. The datasets are generated following the sample size distribution and structure of the dataset seeds (see Table 1) for each of the benchmarks.

Minimum dataset size. The MLPerf Storage benchmark script must be used to run the benchmarks since it calculates the minimum dataset size for each benchmark. It does so using the provided number of simulated accelerators and the size of all of the host node’s memory in GB. The minimum dataset size computation is as follows:

  • Calculate required minimum samples given number of steps per epoch (NB: num_steps_per_epoch is a minimum of 500):
   min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes
  • Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5):
   min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length
  • Ensure we meet both constraints:
   min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes)
  • Calculate minimum files to generate
   min_total_files= min_samples / num_samples_per_file
   min_files_size = min_samples * record_length / 1024 / 1024 / 1024

A minimum of min_total_files files are required which will consume min_files_size GB of storage.

Running the benchmark on a subset of a larger dataset. We support running the benchmark on a subset of the synthetically generated dataset. One can generate a large dataset and then run the benchmark on a subset of that dataset by setting num_files_train or num_files_eval smaller than the number of files available in the dataset folder. Note that if the dataset is stored in multiple subfolders, the subset actually used by this run will be evenly selected from all the subfolders. In this case, num_subfolders_train and num_subfolders_eval need to be equal to the actual number of subfolders inside the dataset folder in order to generate valid results.

Please note that the log file(s) output during the generation step needs to be included in the benchmark results submission package.

4. Single-host Submissions

Submitters can add load to the storage system in two orthogonal ways: (1) increase the number of simulated accelerators inside one host node (i.e., one machine), and/or (2) increase the number of host nodes connected to the storage system.

For single-host submissions, increase the number of simulated accelerators by changing the --num-accelerators parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator.

For single-host submissions, CLOSED and OPEN division results must include benchmark runs for the maximum simulated accelerators that can be run on ONE HOST NODE, in ONE MLPerf Storage job, without going below the 90% accelerator utilization threshold.

5. Distributed Training Submissions

This setup simulates distributed training of a single training task, spread across multiple host nodes, on a shared dataset. The current version of the benchmark only supports data parallelism, not model parallelism.

Submitters must respect the following for multi-host node submissions:

  • All the data must be accessible to all the host nodes.
  • The checkpoint location must reside in the same storage system that stores the dataset.
  • The number of simulated accelerators in each host node must be identical.

While it is recommended that all host nodes be as close as possible to identical, that is not required by these Rules. The fact that distributed training uses a pool-wide common barrier to synchronize the transition from one step to the next of all host nodes results in the overall performance of the cluster being determined by the slowest host node.

Here are a few practical suggestions on how to leverage a set of non-identical hardware, but these are not requirements of these Rules. It is possible to leverage very large physical nodes by using multiple Containers or VM guest images per node, each with dedicated affinity to given CPUs cores and where DRAM capacity and NUMA locality have been configured. Alternatively, larger physical nodes that have higher numbers of cores or additional memory than the others may have those additional cores or memory disabled.

For distributed training submissions, CLOSED and OPEN division results must include benchmark runs for the maximum number of simulated accelerators across all host nodes that can be run in the distributed training setup, without going below the 90% accelerator utilization threshold. Each host node must run the same number of simulated accelerators for the submission to be valid.

6. CLOSED and OPEN Divisions

CLOSED: virtually all changes are disallowed

CLOSED represents a level playing field where all results are comparable across submissions. CLOSED explicitly forfeits flexibility in order to enable easy comparability.

In order to accomplish that, most of the optimizations and customizations to the AI/ML algorithms and framework that might typically be applied during benchmarking or even during production use must be disallowed. Optimizations and customizations to the storage system are allowed in CLOSED.

For CLOSED submissions of this benchmark, the MLPerf Storage codebase takes the place of the AI/ML algorithms and framework, and therefore cannot be changed.

A small number of parameters can be configured in CLOSED submissions; listed in the table below.

Parameter Description Default
Dataset parameters
dataset.num_files_train Number of files for the training set --
dataset.num_subfolders_train Number of subfolders that the training set is stored 0
dataset.data_folder The path where dataset is stored --
Reader parameters
reader.read_threads Number of threads to load the data --
reader.computation_threads Number of threads to preprocess the data(only for bert) --
reader.transfer_size An int64 scalar representing the number of bytes in the read buffer. (only supported for Tensorflow)
reader.prefetch_size An int64 scalar representing the amount of prefetching done, with values of 0, 1, or 2.
Checkpoint parameters
checkpoint.checkpoint_folder The folder to save the checkpoints --
Storage parameters
storage.storage_root The storage root directory ./
storage.storage_type The storage type local_fs

Table 2: Alterable parameters for CLOSED submissions

CLOSED division benchmarks must be referred to using the benchmark name plus the term CLOSED, e.g. “The system was able to support N ACME X100 accelerators running a CLOSED division 3D U-Net workload at only 8% less than optimal performance.”

OPEN: changes are allowed but must be disclosed

OPEN allows more flexibility to tune and change both the benchmark and the storage system configuration to show off new approaches or new features that will benefit the AI/ML Community. OPEN explicitly forfeits comparability to allow showcasing innovation.

The essence of OPEN division results is that for a given benchmark area, they are “best case” results if optimizations and customizations are allowed. The submitter has the opportunity to show the performance of the storage system if an arbitrary, but documented, set of changes are made to the data storage environment or algorithms.

Changes to DLIO itself are allowed in OPEN division submissions. Any changes to DLIO code or command line options must be disclosed.

While changes to DLIO are allowed, changing the workload itself is not. Ie: how the workload is processed can be changed, but those changes cannot fundamentally change the purpose and result of the training. For example, changing the workload imposed upon storage by a ResNet-50 training task into 3D-Unet training task is not allowed.

In addition to what can be changed in the CLOSED submission, the following parameters can be changed in the benchmark.sh script:

Parameter Description Default
framework The machine learning framework. 3D U-Net: PyTorch; ResNet-50: Tensorflow; Cosmoflow: Tensorflow
Dataset parameters
dataset.format Format of the dataset. 3D U-Net: .npz; ResNet-50: .tfrecord; Cosmoflow: .tfrecord
dataset.num_samples_per_file Changing this parameter is supported only with Tensorflow, using tfrecord datasets. Currently, the benchmark code only supports num_samples_per_file = 1 for Pytorch data loader. To support other values, the data loader needs to be adjusted. 3D U-Net: 1; ResNet-50: 1251; Cosmoflow: 1
Reader parameters
reader.data_loader Supported options: Tensorflow or PyTorch. OPEN submissions can have custom data loaders. If a new data loader is added, or an existing data loader is changed, the DLIO code will need to be modified. 3D U-Net: PyTorch (Torch Data Loader); ResNet-50: Tensorflow (Tensorflow Data Loader); Cosmoflow: Tensorflow

OPEN division benchmark submissions must be run through the benchmark.sh script. The .yaml files cannot be changed (the workload cannot be changed). The parameters can be changed only via the command line in order to more-explicitly document what was changed.

OPEN division benchmarks must be referred to using the benchmark name plus the term OPEN, e.g. “The system was able to support N ACME X100 accelerators running an OPEN division 3D U-Net workload at only 8% less than optimal performance.”

7. Submission

A successful run result consists of a mean samples/second measurement (train_throughput_mean_samples_per_second) for a complete benchmark run that achieves mean accelerator utilization (train_au_mean_percentage) equal to or higher than the minimum defined for that workload.

Submissions are made via a git push into a private MLCommons repository at github.com. The link to the repo and the required authentication (eg: userid, password) to access that repo will only be given to people who have registered their intent to submit results in this round (see below for the link to the form).

Many git push operations can be made using that link, but only the last one before the window closes will be considered. Each git push operation should include all of the individual result submissions that you want to be included. Eg: if you want to submit results for A100 and H100, that would be two submissions but only one git push operation.

Several agreements between the submitter and MLCommons must be completed and signed before the submission due date before benchmark results can be submitted. Note: since these are legal agreements, it can take significant time to get them signed, so please plan ahead.

The Intention to submit form is required of everyone who intends to submit results. We collect the email addresses of submitters so we can contact them if needed, to know how many git push authentication credentials to create, and to know who to give those credentials to.

Submitters who are not members of MLCommons need to have signed:

If an organization has already signed these agreements, they do not need to sign them again unless there have been changes to those agreements by MLCommons. Please look at each document for clarification.

What to submit - CLOSED submissions

A complete submission for one workload (3D-Unet, ResNet, or Cosmoflow) contains 3 folders:

  1. results folder, containing, for each system:
    • The entire output folder generated by running MLPerf Storage.
    • Final submission JSON summary file mlperf_storage_report.json. The JSON file must be generated using the ./benchmark.sh reportgen script. The ./benchmark.sh reportgen command must be run on the rank0 machine in order to collect the correct set of files for the submission.
    • Structure the output as shown in this example
    • The logs from the dataset generation step that built the files that this benchmark run read from.
  2. systems folder, containing:
    • <system-name>.json
    • <system-name>.pdf
    • For system naming examples look here
  3. code folder, containing:
    • Source code of the benchmark implementation. The submission source code and logs must be made available to other submitters for auditing purposes during the review period.

What to submit - OPEN submissions

  • Everything that is required for a CLOSED submission, following the same structure.
  • Additionally, the source code used for the OPEN Submission benchmark implementations must be available under a license that permits MLCommon to use the implementation for benchmarking.

Directory Structure for CLOSED or OPEN Submissions

root_folder (or any name you prefer)
├── Closed
│ 	└──<submitter_org>
│		├── code
│		├── generation_logs
│		├── results
│		│	├──system-name-1
│		│	│	├── unet3d-a100
│		│	│	│	└── ..
│		│	│	├── unet3d-h100
│		│	│	│	└── ..
|		│	|	├── resnet-a100
│		│	│	│	└── ..
|		│	|	├── resnet-h100
│		│	│	│	└── ..
|		│	|	├── cosmoflow-a100	
│		│	│	|	└── ..
|		│	|	└── cosmoflow-h100	
│		│	│		└── ..
│		│	└──system-name-2
│		│	 	├── unet3d-a100
│		│	 	│	└── ..
│		│	 	├── unet3d-h100
│		│	 	│	└── ..
|		│	 	├── resnet-a100
│		│	 	│	└── ..
|		│	 	├── resnet-h100
│		│	 	│	└── ..
|		│	 	├── cosmoflow-a100	
│		│	 	|	└── ..
|		│	 	└── cosmoflow-h100	
│		│	 		└── ..
│		└── systems
│			system-name-1.json
│			system-name-1.pdf
│			system-name-2.json
│			system-name-2.pdf
│
└── Open
 	└──<submitter_org>
		├── code
 		├── generation_logs
 		├── results
 		│	├──system-name-1
 		│	│	├── unet3d-a100
 		│	│	│	└── ..
 		│	│	├── unet3d-h100
 		│	│	│	└── ..
 		│	|	├── resnet-a100
 		│	│	│	└── ..
 		│	|	├── resnet-h100
 		│	│	│	└── ..
 		│	|	├── cosmoflow-a100	
 		│	│	|	└── ..
 		│	|	└── cosmoflow-h100	
 		│	│		└── ..
 		│	└──system-name-2
 		│	 	├── unet3d-a100
 		│	 	│	└── ..
 		│	 	├── unet3d-h100
 		│	 	│	└── ..
 		│	 	├── resnet-a100
 		│	 	│	└── ..
 		│	 	├── resnet-h100
 		│	 	│	└── ..
 		│	 	├── cosmoflow-a100	
 		│	 	|	└── ..
 		│	 	└── cosmoflow-h100	
 		│	 		└── ..
		└── systems
			system-name-1.json
			system-name-1.pdf
			system-name-2.json
			system-name-2.pdf

System Description

The purpose of the system description is to provide sufficient detail on the storage system under test, and the host nodes running the test, plus the network connecting them, to enable full reproduction of the benchmark results by a third party.

Each submission must contain a <system-name>.json file and a <system-name>.pdf file. If you submit more than one benchmark result, each submission must have a unique <system-name>.json file and a <system-name>.pdf file that documents the system under test and the environment that generated that result, including any configuration options in effect.

Note that, during the review period, submitters may be asked to include additional details in the JSON and pdf to enable reproducibility by a third party.

System Description JSON

The <system-name>.json file must be pass a validation check with the JSON schema in use for V1.0. The schema and two examples of it being used are provided. For example, check-jsonschema is a convenient tool that is present in many linux distributions, but other tools may be used.

System Description PDF

The goal of the pdf is to complement the JSON file, providing additional detail on the system to enable full reproduction by a third party. We encourage submitters to add details that are more easily captured by diagrams and text description, rather than a JSON.

This file is supposed to include everything that a third party would need in order to recreate the results in the submission, including product model numbers or hardware config details, unit counts of drives and/or components, system and network topologies, software used with version numbers, and any non-default configuration options used by any of the above.

The following recommended structure of systems.pdf provides a starting point and is optional. Submitters are free to adjust this structure as they see fit.

A great example of a system description pdf can be found here.

If the submission is for a commercial system, a pdf of the product spec document can add significant value. If it is a system that does not have a spec document (e.g., a research system, HPC etc), or the product spec pdf doesn’t include all the required detail, the document can contain (all these are optional):

  • Recommended: High-level system diagram e.g., showing the host node(s), storage system main components, and network topology used when connecting everything (e.g., spine-and-leaf, butterfly, etc.), and any non-default configuration options that were set during the benchmark run.
  • Optional: Additional text description of the system, if the information is not captured in the JSON, e.g., the storage system’s components (make and model, optional features, capabilities, etc) and all configuration settings that are relevant to ML/AI benchmarks. If the make/model doesn’t specify all the components of the hardware platform it is running on, eg: it’s an Software-Defined-Storage product, then those should be included here (just like the client component list).
  • Optional: power requirements – If the system requires the physical deployment of hardware, consider including the “not to exceed” power requirements for the system to run the MLCommons storage benchmark workload. Additional information can include the total nameplate power rating and the peak power consumption during the benchmark.
  • Optional: physical requirements – If the system requires the physical deployment of hardware, consider including the number of rack units, required supporting equipment, and any physical constraints on how the equipment must be installed into an industry-standard rack, such as required spacing, weight constraints, etc. We recommended the following three categories for the text description:
    1. Software,
    2. Hardware, and
    3. Settings.

8. Review

Visibility of results and code during review

During the review process, only certain groups are allowed to inspect results and code.

Group Can Inspect
Review committee All results, all code
Submitters All results, all code
Public No results, no code

Filing objections

Submitters must officially file objections to other submitter’s code by creating a GitHub issue prior to the “Filing objections” deadline that cites the offending lines, the rules section violated, and, if pertinent, corresponding lines of the reference implementation that are not equivalent. Each submitter must file objections with a “by ” tag and a “against ” tag. Multiple organizations may append their “by ” to an existing objection if desired. If an objector comes to believe the objection is in error they may remove their “by ” tag. All objections with no “by ” tags at the end of the filing deadline will be closed. Submitters should file an objection, then discuss with the submitter to verify if the objection is correct. Following filing of an issue but before resolution, both objecting submitter and owning submitter may add comments to help the review committee understand the problem. If the owning submitter acknowledges the problem, they may append the “fix_required” tag and begin to fix the issue.

Resolving objections

The review committee will review each objection, and either establish consensus or vote. If the committee votes to support an objection, it will provide some basic guidance on an acceptable fix and append the “fix_required” tag. If the committee votes against an objection, it will close the issue.

Fixing objections

Code should be updated via a pull request prior to the “fixing objections” deadline. Following submission of all fixes, the objecting submitter should confirm that the objection has been addressed with the objector(s) and ask them to remove their “by tags. If the objector is not satisfied by the fix, then the review committee will decide the issue at its final review meeting. The review committee may vote to accept a fix and close the issue, or reject a fix and request the submission be moved to open or withdrawn.

Withdrawing results / changing division

Anytime up until the final human readable deadline (typically within 2-3 business days before the press call, so June 5th, 2024, in this case), an entry may be withdrawn by amending the pull request. Alternatively, an entry may be voluntarily moved from the closed division to the open division. Each benchmark results submission is treated separately for reporting in the results table and in terms of withdrawing it. For example, submitting a 3D-Unet run with 20 clients and 80 A100 accelerators is separate from submitting a 3D-Unet run with 19 clients and 76 accelerators.

9. Roadmap for future MLPerf Storage releases

The Working Group is very interested in your feedback. Please contact [email protected] with any suggestions.

Our working group aims to add the following features in a future version of the benchmark:

  • We plan to add support for the “data pre-processing” phase of AI/ML workload as we are aware that this is a significant load on a storage system and is not well represented by existing AI/ML benchmarks.
  • Add support for other types of storage systems (e.g., Object Stores) in the CLOSED division.
  • Expand the number of workloads in the benchmark suite e.g.,add a large language model (GPT3), and a diffusion model (Stable Diffusion).
  • Add support for PyTorch and Tensorflow in the CLOSED division for all workloads.
  • Continue adding support for more types of accelerators.
  • We plan to add support for benchmarking a storage system while running more than one MLPerf Storage benchmark at the same time (ie: more than one Training job type, such as 3DUnet and Recommender at the same time), but the current version requires that a submission only include one such job type per submission.