Redesigning the Cardea Class #85

sarahmish · 2021-04-01T23:21:37Z

This issue is to track the development of the Cardea class. Previous updates were also mentioned in #73.

The Cardea class is responsible for handling and interacting with all the components (data_assembler, data_labeler, featurizer, and modeler).

Overall, the Cardea class:

Provides simple user-facing abstractions
- label: generate label times
- featurize: generate feature matrix
- fit/predict
- evaluate
- save/load
Hides away the interaction with other systems
- Entityset
- Featuretools DeepFeatureSynthesis
- ComposeML
- MLBlocks Pipelines
- Pipeline Selection and Tuning

design choices:

remove load_entityset and make the assumption that a cardea instance only deals with one data source. The data is loaded upon instantiation.
change generate_label_time -> label.
change generate_feature_matrix -> featurize.
allow the user to inspect label_times and feature_matrix.

This should be the class public interface:

class Cardea:

    def __init__(self, 
                 data: str = DEFAULT_DATA, 
                 labeler: FunctionType = DEFAULT_LABELER,
                 pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE, 
                 hyperparameters: dict = None):
        pass

    def label(self, 
              labeler: FunctionType = None,
              parameter: dict = None) -> pd.DataFrame:
        """Create label times using the data labeler.

        Args:
            labeler (function):
                Function that defines the prediction task, it should return a
                tuple of labeling function, the dataframe, and the name of the
                target entity.
            parameter (dict):
                Variables to change the default parameters, if any.

        Returns:
            pandas.DataFrame:
                A dataframe of cutoff times and their target labels.
        """
        pass

    def featurize(self, 
                  label_times: pd.DataFrame,
                  verbose: bool = False) -> pd.DataFrame:
        """Returns a the calculated feature matrix.

        Args:
            label_times (pandas.DataFrame):
                A dataframe that indicates cutoff time for each instance.
            verbose (bool):
                Indicate verbosity of the featurization.

        Returns:
            pandas.DataFrame:
                Generated feature matrix.
        """
        pass

    def fit(self, 
            X: Union[np.ndarray, pd.DataFrame], 
            y: Union[np.ndarray, pd.Series, list],
            tune: bool = False, 
            max_evals: int = 10, 
            scoring: str = None,
            verbose: bool = False) -> None:
        """Train the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.
        """
        pass

    def predict(self, X: Union[np.ndarray, pd.DataFrame]) -> Union[np.ndarray, list]:
        """Get predictions from the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.

        Returns:
            numpy.ndarray or list:
                Predictions to the input data.
        """
        pass

    def fit_predict(self, 
                    X: Union[np.ndarray, pd.DataFrame],
                    y: Union[np.ndarray, pd.Series, list], 
                    tune: bool = False,
                    max_evals: int = 10, 
                    scoring: str = None,
                    verbose: bool = False) -> Union[np.ndarray, list]:
        """Train a cardea pipeline then make predictions.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            numpy.ndarray:
                Predictions to the input data.
        """
        pass

    def evaluate(self, 
                 X: Union[np.ndarray, pd.DataFrame], 
                 y: Union[np.ndarray, pd.Series, list],
                 test_size: float = 0.2, 
                 shuffle: bool = True, fit: bool = False,
                 tune: bool = False, 
                 max_evals: int = 10, 
                 scoring: str = None,
                 metrics: List[str] = DEFAULT_METRICS, 
                 verbose: bool = False) -> pd.Series:
        """Evaluate the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            test_size (float):
                The proportion of the dataset to include in the test dataset.
            shuffle (bool):
                Whether or not to shuffle the data before splitting.
            fit (bool):
                Whether to fit the pipeline before evaluating it.
                Defaults to ``False``.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            metrics (list):
                A list of scoring function names. The scoring functions should be consistent
                with the problem type.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            pandas.Series:
                ``pandas.Series`` containing one element for each
                metric applied, with the metric name as index.
        """
        pass

    def save(self, path: str) -> None:
        """Save this object using pickle.

        Args:
            path (str):
                Path to the file where the serialization of
                this object will be stored.
        """
        pass

    def load(cls, path: str) -> Cardea:
        """Load an Cardea instance from a pickle file.

        Args:
            path (str):
                Path to the file where the instance has been
                previously serialized.

        Returns:
            Cardea:
                A Cardea instance

        Raises:
            ValueError:
                If the serialized object is not an Cardea instance.
        """
        pass

In addition to the main APIs, there will be helper functions such as set_pipeline, and train_test_split to allow the users to have a bit of flexibility in modifying the MLPipeline and hyperparameters.

The text was updated successfully, but these errors were encountered:

sarahmish · 2021-04-07T05:02:29Z

Additional requirement:

add additional data: new function to allow users to specify another path to data which contains additional tables and/or columns.

Proposing the following change:

make load_entityset method create an entityset from scratch.
create a new function add_entities that expects the data path of new data.

sarahmish added the enhancement New feature or request label Apr 2, 2021

This was referenced Apr 14, 2021

Cardea class predict functionality #90

Open

Cardea class, functional api, and compose #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesigning the Cardea Class #85

Redesigning the Cardea Class #85

sarahmish commented Apr 1, 2021 •

edited

Loading

sarahmish commented Apr 7, 2021

Redesigning the Cardea Class #85

Redesigning the Cardea Class #85

Comments

sarahmish commented Apr 1, 2021 • edited Loading

sarahmish commented Apr 7, 2021

sarahmish commented Apr 1, 2021 •

edited

Loading