Skip to content

Execution Layer

Supun Nakandala edited this page Mar 2, 2021 · 5 revisions

All code related to the backend layer are located inside cerebro.backend package. Cerebro currently support only Spark as the execution backend. The execution backend integrated with one of the storage layers and provides the following features:

  • Materializing the training data: e.g., Writing the (pre-proceesed) training data in Spark DataFrames into HDFS storage.
  • Implement model hopper parallelism.

Details on how training data materialization happens is provided under the storage layer details. We next provide more details on the model hopper parallelism implementation in the Spark backend.

Model Hopper Parallelism Implementation in the Spark Backend

Model hopper parallelism requires running multiple workers that perform sub-epoch training and a driver program that orchestrates the scheduling of sub-epochs on workers. In the spark backend, workers are implemented as long running services that gets invoked inside long-running map paritition functions in Spark. We create a dummy RDD with number of partions set to the number of workers we need for Cerebro and use that to invoke the map partition function. After the initiation, these services register with the Cerebro driver by providing service host and port number. Cerebro driver also runs a service for this registration purpose which runs in the same machine where Spark driver runs. Cerbro services can be intialized by calling the backend.initialize_workers() method.

Cerebro worker service in Spark is cable of performing the following tasks:

  1. Initialize data loaders: This method initializes the Petastorm training and validation data loaders for the model selection workload. It should be done once upfrom for every model selection workload. After the intialization, data loaders will be cached and reused for all model trainings.
  2. Execute sub-epoch: This method invokes the training of a single sub-epoch. The training procedure is wrapped inside serialized a function and send to the service as a parameter. The service deserializes the function and will run the function. All model training aspects including model initialization, checkpoint restoring, and checkpoint creation are handled by this function. The Cerebro service is agnostic to the details of how model training works. It just invokes the provided trainer function by passing the initialized data loaders.
  3. Teardown service: Stops the running service and finishes the encapsulating map partition function.

Spark backend implementation uses the above tasks provided by the worker service to implement model hopper paralleism. It provides a per-epoch based scheduling functionality (backend.train_for_one_epoch(...)) to the higher-level model selection APIs. Essentially, it takes in a set of model configurations to be trained and completes one epoch of training for all models by invoking the required sub-epochs. Finally, it returns the training metrics of the models back.

Adding Support for a New Execution Backend

In order to add support for a new execution backend, one has to implement a new class extending the cerebro.backend.Backend class.

Clone this wiki locally