Replace spaces with underscores in column names also for the predict function #689
Closed
GoldenGoldy
started this conversation in
Ideas
Replies: 1 comment
-
I think this is a bug. I transferred to #690. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I found that PySR warns about spaces in column names when passing the .fit function data where this occurs. It then replaces the spaces in the column names with underscores and prints a warning about this. You can then proceed with fitting the data as per normal.
When later calling the .predict function, this does not attempt to make the same replacement of spaces with underscores in the column names.
So, if we have a fitted model and want to use it to make predictions, and we pass data to the .predict function in the same format that we used for the .fit function, we can run into the following issue:
The predict function (in sr.py) contains the following code line "X = X.reindex(columns=self.feature_names_in_)". This results in NaN values in case the column names have spaces, because now it tries to match the column names (with spaces) with the feature names of the model, but in the latter the spaces were replaced by underscores.
We then get the somewhat confusing message "ValueError: Input X contains NaN.", which leads one to believe that there are NaN values in the data even while there are none, they only get introduced by the reindex which can't match the column names.
All this can be avoided of course, once you are aware of the problem and avoid using spaces in the column names from the beginning. However, it might be more consistent, and allow for a better user experience, if the .predict function also replaces spaces in the column names with underscores?
Beta Was this translation helpful? Give feedback.
All reactions