Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature names unseen at fit time #722

Open
yiannis-gkoufas opened this issue Apr 20, 2024 · 7 comments
Open

Feature names unseen at fit time #722

yiannis-gkoufas opened this issue Apr 20, 2024 · 7 comments

Comments

@yiannis-gkoufas
Copy link

Hi!

I want to use mljar for binary classification (category1+category2).
The parameters I am passing to AutoML are the following:

automl = AutoML(results_path=str(model_directory),
                            mode="Compete",
                            total_time_limit=600 * 600,
                            golden_features=True,
                            features_selection=True,
                            ml_task="binary_classification")

In the params.json I see "best_model": "Ensemble_Stacked"
When I try to run a prediction I get:

Feature names unseen at fit time:
- 100_LightGBM_GoldenFeatures_prediction
- 101_LightGBM_GoldenFeatures_prediction
- 103_LightGBM_GoldenFeatures_prediction
- 105_Xgboost_prediction
- 108_CatBoost_prediction
- ...
Feature names seen at fit time, yet now missing:
- 100_LightGBM_GoldenFeatures_prediction_0_for_category1_1_for_category2
- 101_LightGBM_GoldenFeatures_prediction_0_for_category1_1_for_category2
- 103_LightGBM_GoldenFeatures_prediction_0_for_category1_1_for_category2
- 105_Xgboost_prediction_0_for_category1_1_for_category2
- 108_CatBoost_prediction_0_for_category1_1_for_category2

Any help would be appreciated!

@pplonski
Copy link
Contributor

Hi @yiannis-gkoufas,

I understand that you were able to train ML models with AutoML but there is problem with predictions only. Could you please provide code that you are using for computing predictions?

@yiannis-gkoufas
Copy link
Author

Hi @pplonski!

I use the same constructor for AutoML and pass a dataframe.

automl = AutoML(results_path=str(model_directory),
                            mode="Compete",
                            total_time_limit=600 * 600,
                            golden_features=True,
                            features_selection=True,
                            ml_task="binary_classification")

Could it be an issue with the ensemble model?

@pplonski
Copy link
Contributor

Thank you @yiannis-gkoufas for response. It looks like some bug with computing predictions for Stacked Ensemble. Is it possible to share full code and data to reproduce the issue?

@yiannis-gkoufas
Copy link
Author

This code:

from sklearn.model_selection import train_test_split
from supervised import AutoML
import pandas as pd

if __name__ == '__main__':
    df = pd.read_csv(
        "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
        skipinitialspace=True,
    )
    X_train, X_test, y_train, y_test = train_test_split(
        df[df.columns[:-1]], df["income"], test_size=0.25
    )

    automl = AutoML(results_path="./model",
                    mode="Compete",
                    total_time_limit=600 * 600,
                    golden_features=True,
                    features_selection=True,
                    ml_task="binary_classification")
    automl.fit(X_train, y_train)

    predictions = automl.predict(X_test)
    print(predictions)

reproduced the issue for me, because the ensemble stacked is identified as the best model.
It takes a while to run ofcourse. The message I got:

Traceback (most recent call last):
  File "/Users/prezi/Code/mljar_issue/mljar_issue/main.py", line 23, in <module>
    predictions = automl.predict(X_test)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/supervised/automl.py", line 451, in predict
    return self._predict(X)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/supervised/base_automl.py", line 1503, in _predict
    predictions = self._base_predict(X)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/supervised/base_automl.py", line 1465, in _base_predict
    predictions = model.predict(X, X_stacked)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/supervised/ensemble.py", line 434, in predict
    y_predicted_from_model = model.predict(X_stacked)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/supervised/model_framework.py", line 448, in predict
    y_p = learner.predict(X_data)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/supervised/algorithms/sklearn.py", line 66, in predict
    return self.model.predict_proba(X)[:, 1]
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 947, in predict_proba
    X = self._validate_X_predict(X)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/sklearn/ensemble/_forest.py", line 641, in _validate_X_predict
    X = self._validate_data(
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/sklearn/base.py", line 608, in _validate_data
    self._check_feature_names(X, reset=reset)
  File "/Users/prezi/Library/Caches/pypoetry/virtualenvs/mljar-issue-kQcsGfQC-py3.10/lib/python3.10/site-packages/sklearn/base.py", line 535, in _check_feature_names
    raise ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- 100_NearestNeighbors_prediction
- 101_NearestNeighbors_prediction
- 102_Xgboost_BoostOnErrors_prediction
- 102_Xgboost_prediction
- 103_Xgboost_prediction
- ...
Feature names seen at fit time, yet now missing:
- 100_NearestNeighbors_prediction_0_for_<=50K_1_for_>50K
- 101_NearestNeighbors_prediction_0_for_<=50K_1_for_>50K
- 102_Xgboost_BoostOnErrors_prediction_0_for_<=50K_1_for_>50K
- 102_Xgboost_prediction_0_for_<=50K_1_for_>50K
- 103_Xgboost_prediction_0_for_<=50K_1_for_>50K
- ...

@tuomassiren
Copy link

I have the same issue.

@minari1505
Copy link

I have the same issue.
The same error occurs in linux, macOS environments. (Window has not been tested.)
When I change the dataset or the model used.

@pplonski
Copy link
Contributor

It is related to #719. @Marchlak please take a look at this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants