Skip to content
scnakandala edited this page Feb 25, 2021 · 9 revisions

Welcome to the Cerebro wiki! This wiki is primarly intended for those who wish to contribute to cerebro-system.

Table of Contents

Package Diagram

Cerebro architecture is expected to continue evolve based on research and development needs. Architecture evolution is incremental within minor versions of release and significant amongst major version. The current high-level architecture is as follows:

Execution Layer

Storage Layer

Cerebro currently supports two different storage mediums: Local file system/NFS (cerebro.storage.LocalStore) and HDFS (cerebro.storage.HDFSStore). This storage medium is used for the flowwing tasks:

  • Storing training data
  • Storing model checkpoints
  • Storing model training metrics in TensorBoard format.

In order to create a Storage object, user has to specify the prefix path to a directory. Cerebro will create the following sub-directory structure inside that directory to organize the above data artifacts.

  • <prefix_path>/train_data : Contains all the validation data
  • <prefix_path>/val_data : Contains all the training data
  • <prefix_path>/runs : Contains the latest checkpoint for every model organized in its own sub-directory named by the model ID.
  • <prefix_path>/logs : Contains the model training metrics logged in TensorBoard format. Each model has its own sub-directory named by the model ID.

It is also possible to override the above default naming convention during the Storage object creation time. For more details see here.

Generating Training and Validation Data

Training and validation data can be generated in two ways:

  1. As part of the model selection invocation : In this approach, users can simply pass the input data (e.g., in the form of a Spark DataFrame) directly into the model selection objects .fit(df) method. Behind the scenes, Cerebro will materialize the data into the storage medium before invoking the model selection process. Training and validation splits are generated based on a user specfied fraction or a column indicator in the input dataframe which has to be set when intializing the model selection object.

  2. As separate step without any model selection invocation : In this approach, users can materialize the training data as a separate step. In order to do this they have create a Cerebro backend object (e.g., cerebro.backend.SparkBackend) and invoke the .prepare_data(..) method proving a storage object and input data (e.g., Spark DataFrame) as follows:

     backend = SparkBackend(spark_context=...)
     storage = HDFSStore('hdfs://host:port/exp_data')
     backend.prepare_data(storage, df, validation=0.25, feature_columns=['features'], label_columns=['label'])
    

    After the above step is done one time, users can create a Storage object pointing to the same storage directory and use it when performing model selection. Instead of .fit(df) method, they need to now invoke .fit_on_prepared_data() method.

For both approaches, underneath Cerebro uses the Petastorm library to materialize the data and subsequently read them during model training. Petastorm enables the use of Parquet storage from Tensorflow, Pytorch, and other Python-based ML training frameworks. One could also generate training and validation data using Petastorm outside of Cerebro and use them in Cerebro for model training.

All code related to the storage layer are located inside the cerebro.storage package.

Clone this wiki locally