Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesigning the Cardea Class #85

Open
sarahmish opened this issue Apr 1, 2021 · 1 comment · May be fixed by #92
Open

Redesigning the Cardea Class #85

sarahmish opened this issue Apr 1, 2021 · 1 comment · May be fixed by #92
Labels
enhancement New feature or request

Comments

@sarahmish
Copy link
Collaborator

sarahmish commented Apr 1, 2021

This issue is to track the development of the Cardea class. Previous updates were also mentioned in #73.

The Cardea class is responsible for handling and interacting with all the components (data_assembler, data_labeler, featurizer, and modeler).

Overall, the Cardea class:

  • Provides simple user-facing abstractions
    • label: generate label times
    • featurize: generate feature matrix
    • fit/predict
    • evaluate
    • save/load
  • Hides away the interaction with other systems
    • Entityset
    • Featuretools DeepFeatureSynthesis
    • ComposeML
    • MLBlocks Pipelines
    • Pipeline Selection and Tuning

design choices:

  1. remove load_entityset and make the assumption that a cardea instance only deals with one data source. The data is loaded upon instantiation.
  2. change generate_label_time -> label.
  3. change generate_feature_matrix -> featurize.
  4. allow the user to inspect label_times and feature_matrix.

This should be the class public interface:

class Cardea:

    def __init__(self, 
                 data: str = DEFAULT_DATA, 
                 labeler: FunctionType = DEFAULT_LABELER,
                 pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE, 
                 hyperparameters: dict = None):
        pass

    def label(self, 
              labeler: FunctionType = None,
              parameter: dict = None) -> pd.DataFrame:
        """Create label times using the data labeler.

        Args:
            labeler (function):
                Function that defines the prediction task, it should return a
                tuple of labeling function, the dataframe, and the name of the
                target entity.
            parameter (dict):
                Variables to change the default parameters, if any.

        Returns:
            pandas.DataFrame:
                A dataframe of cutoff times and their target labels.
        """
        pass

    def featurize(self, 
                  label_times: pd.DataFrame,
                  verbose: bool = False) -> pd.DataFrame:
        """Returns a the calculated feature matrix.

        Args:
            label_times (pandas.DataFrame):
                A dataframe that indicates cutoff time for each instance.
            verbose (bool):
                Indicate verbosity of the featurization.

        Returns:
            pandas.DataFrame:
                Generated feature matrix.
        """
        pass

    def fit(self, 
            X: Union[np.ndarray, pd.DataFrame], 
            y: Union[np.ndarray, pd.Series, list],
            tune: bool = False, 
            max_evals: int = 10, 
            scoring: str = None,
            verbose: bool = False) -> None:
        """Train the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.
        """
        pass

    def predict(self, X: Union[np.ndarray, pd.DataFrame]) -> Union[np.ndarray, list]:
        """Get predictions from the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.

        Returns:
            numpy.ndarray or list:
                Predictions to the input data.
        """
        pass

    def fit_predict(self, 
                    X: Union[np.ndarray, pd.DataFrame],
                    y: Union[np.ndarray, pd.Series, list], 
                    tune: bool = False,
                    max_evals: int = 10, 
                    scoring: str = None,
                    verbose: bool = False) -> Union[np.ndarray, list]:
        """Train a cardea pipeline then make predictions.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            numpy.ndarray:
                Predictions to the input data.
        """
        pass

    def evaluate(self, 
                 X: Union[np.ndarray, pd.DataFrame], 
                 y: Union[np.ndarray, pd.Series, list],
                 test_size: float = 0.2, 
                 shuffle: bool = True, fit: bool = False,
                 tune: bool = False, 
                 max_evals: int = 10, 
                 scoring: str = None,
                 metrics: List[str] = DEFAULT_METRICS, 
                 verbose: bool = False) -> pd.Series:
        """Evaluate the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            test_size (float):
                The proportion of the dataset to include in the test dataset.
            shuffle (bool):
                Whether or not to shuffle the data before splitting.
            fit (bool):
                Whether to fit the pipeline before evaluating it.
                Defaults to ``False``.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            metrics (list):
                A list of scoring function names. The scoring functions should be consistent
                with the problem type.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            pandas.Series:
                ``pandas.Series`` containing one element for each
                metric applied, with the metric name as index.
        """
        pass

    def save(self, path: str) -> None:
        """Save this object using pickle.

        Args:
            path (str):
                Path to the file where the serialization of
                this object will be stored.
        """
        pass

    def load(cls, path: str) -> Cardea:
        """Load an Cardea instance from a pickle file.

        Args:
            path (str):
                Path to the file where the instance has been
                previously serialized.

        Returns:
            Cardea:
                A Cardea instance

        Raises:
            ValueError:
                If the serialized object is not an Cardea instance.
        """
        pass

In addition to the main APIs, there will be helper functions such as set_pipeline, and train_test_split to allow the users to have a bit of flexibility in modifying the MLPipeline and hyperparameters.

@sarahmish sarahmish added the enhancement New feature or request label Apr 2, 2021
@sarahmish
Copy link
Collaborator Author

Additional requirement:

  • add additional data: new function to allow users to specify another path to data which contains additional tables and/or columns.

Proposing the following change:

  1. make load_entityset method create an entityset from scratch.
  2. create a new function add_entities that expects the data path of new data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant