Home

Welcome to the Cerebro wiki! This wiki is primarly intended for those who wish to contribute to cerebro-system.

<prefix_path>/train_data : Contains all the validation data
<prefix_path>/val_data : Contains all the training data
<prefix_path>/runs : Contains the latest checkpoint for every model organized in its own sub-directory named by the model ID.
<prefix_path>/logs : Contains the model training metrics logged in TensorBoard format. Each model has its own sub-directory named by the model ID.

It is also possible to override the above default naming convention during the Storage object creation time. For more details see here.

Generating Training and Validation Data

Training and validation data can be generated in two ways:

As part of the model selection invocation : In this approach, users can simply pass the input data (e.g., in the form of a Spark DataFrame) directly into the model selection objects .fit(df) method. Behind the scenes, Cerebro will materialize the data into the storage medium before invoking the model selection process. Training and validation splits are generated based on a user specfied fraction or a column indicator in the input dataframe which has to be set when intializing the model selection object.
As separate step without any model selection invocation : In this approach, users can materialize the training data as a separate step. In order to do this they have create a Cerebro backend object (e.g., cerebro.backend.SparkBackend) and invoke the .prepare_data(..) method proving a storage object and input data (e.g., Spark DataFrame) as follows:
```
 backend = SparkBackend(spark_context=...)
 storage = HDFSStore('hdfs://host:port/exp_data')
 backend.prepare_data(storage, df, validation=0.25, feature_columns=['features'], label_columns=['label'])
```
After the above step is done one time, users can create a Storage object pointing to the same storage directory and use it when performing model selection. Instead of .fit(df) method, they need to now invoke .fit_on_prepared_data() method.

For both approaches, underneath Cerebro uses the Petastorm library to materialize the data and subsequently read them during model training. Petastorm enables the use of Parquet storage from Tensorflow, Pytorch, and other Python-based ML training frameworks. One could also generate training and validation data using Petastorm outside of Cerebro and use them in Cerebro for model training.

All code related to the storage layer are located inside the cerebro.storage package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Table of Contents

Package Diagram

Execution Layer

Storage Layer

Generating Training and Validation Data

Clone this wiki locally