-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework TSDataset.train_test_split
to pass all features to train and test parts
#545
Conversation
🚀 Deployed on https://deploy-preview-545--etna-docs.netlify.app |
I suppose, the current failing tests are connected to Explanation:
Possible solutions
This problem could be related to #440. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #545 +/- ##
==========================================
- Coverage 90.42% 90.40% -0.02%
==========================================
Files 262 262
Lines 18234 18244 +10
==========================================
+ Hits 16488 16494 +6
- Misses 1746 1750 +4 ☔ View full report in Codecov by Sentry. |
etna/datasets/tsdataset.py
Outdated
test._regressors = deepcopy(self.regressors) | ||
test._target_components_names = deepcopy(self.target_components_names) | ||
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names) | ||
train = deepcopy(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come up with a more economical solution in terms of resource usage. Currently here we might occupy up to 3x of memory compared to the original size. Also here we are unnecessarily copying df_exog
when reference to the original could be used. Here we could try to utilize pandas copy/view semantics instead of explicitly copying dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added new version of code, but it should be discussed: there are some possible problems.
etna/datasets/tsdataset.py
Outdated
test._regressors = deepcopy(self.regressors) | ||
test._target_components_names = deepcopy(self.target_components_names) | ||
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names) | ||
# TODO: there is a risk that some methods with inplace=True dropping of columns will affect train and test dataframes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should discuss this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't such operations modify only df
, which is copied ?
etna/datasets/tsdataset.py
Outdated
test = deepcopy(self) | ||
test.df = self_df.loc[test_start_defined:test_end_defined] | ||
test.raw_df = self_raw_df.loc[train_start_defined:test_end_defined] | ||
# we do this to optimize memory consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line could be moved to 1231.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it is necessary to repeat this comment
etna/datasets/tsdataset.py
Outdated
test = deepcopy(self) | ||
test.df = self_df.loc[test_start_defined:test_end_defined] | ||
test.raw_df = self_raw_df.loc[train_start_defined:test_end_defined] | ||
# we do this to optimize memory consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it is necessary to repeat this comment
etna/datasets/tsdataset.py
Outdated
test._regressors = deepcopy(self.regressors) | ||
test._target_components_names = deepcopy(self.target_components_names) | ||
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names) | ||
# TODO: there is a risk that some methods with inplace=True dropping of columns will affect train and test dataframes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't such operations modify only df
, which is copied ?
assert sorted(train.target_components_names) == sorted(ts_with_target_components.target_components_names) | ||
assert sorted(test.target_components_names) == sorted(ts_with_target_components.target_components_names) | ||
assert set(train_target_components.columns.get_level_values("feature")) == set(train.target_components_names) | ||
assert set(test_target_components.columns.get_level_values("feature")) == set(test.target_components_names) | ||
|
||
|
||
def test_train_test_split_pass_prediction_intervals_to_output(ts_with_prediction_intervals): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a similar test for hierarchy, testing that structure is copied and current levels are preserved. Seems like we lack this test in general.
test._target_components_names = deepcopy(self.target_components_names) | ||
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names) | ||
self_df = self.df | ||
self_raw_df = self.raw_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't do the same for df_exog
, because deepcopy(self)
should handle it properly.
etna/datasets/tsdataset.py
Outdated
|
||
# we want to make sure it makes only one copy | ||
train_df = self_df.loc[train_start_defined:train_end_defined] | ||
if train_df._is_view: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine this kind of slice gives _is_view
, but I don't really know if it always the case for all OS/pandas versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do
if train_df._is_view or train_df._is_copy is not None
etna/datasets/tsdataset.py
Outdated
|
||
# we want to make sure it makes only one copy | ||
train_df = self_df.loc[train_start_defined:train_end_defined] | ||
if train_df._is_view: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should use private is_view
. Tests will show if it is available on all pandas versions.
Before submitting (must do checklist)
Proposed Changes
Rework
TSDataset.train_test_split
to pass all features to train and test partsClosing issues
Closes #272.