Rework `TSDataset.train_test_split` to pass all features to train and test parts #545

d-a-bunin · 2024-12-25T12:26:47Z

Before submitting (must do checklist)

Did you read the contribution guide?
Did you update the docs? We use Numpy format for all the methods and classes.
Did you write any new necessary tests?
Did you update the CHANGELOG?

Proposed Changes

Rework TSDataset.train_test_split to pass all features to train and test parts

Closing issues

Closes #272.

etna/datasets/tsdataset.py

tests/test_datasets/test_dataset.py

github-actions · 2024-12-25T12:31:00Z

🚀 Deployed on https://deploy-preview-545--etna-docs.netlify.app

d-a-bunin · 2024-12-25T14:29:31Z

I suppose, the current failing tests are connected to Explanation:

Inside of Pipeline.fit we don't revert the dataset fully to the initial state and keep some of the columns. As a result, after fit there could be additional columns like date flags
Inside of backtest we are making train_test_split. Previously, train_test_split kept only columns from raw_df, and it removed columns created in Pipeline.fit, now it isn't happening and these columns are kept in train and test parts.
We apply transforms on train part and it creates duplicate columns with date flags

Possible solutions

Change logic of Pipeline.fit and just make a copy before applying transforms.
- It guarantees that we don't break anything inside transforms
- It increases memory consumption
Revert TSDataset.train_test_split to current logic
- Current logic is contradictory, because it puts _regressors, target_components_names and _prediction_intervals_names columns that don't exist in result train and test datasets
Rework train_test_split to give "empty" datasets only with columns from raw_df
- In that case we shouldn't set _regressors, target_components_names and _prediction_intervals_names
- It is strange that we take columns from raw_df, but values from df, it could lead to strange results

This problem could be related to #440.

etna/datasets/tsdataset.py

codecov · 2024-12-26T16:06:29Z

Codecov Report

Attention: Patch coverage is 87.87879% with 4 lines in your changes missing coverage. Please review.

Project coverage is 90.40%. Comparing base (d192f84) to head (49e022d).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
etna/datasets/tsdataset.py	84.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
- Coverage   90.42%   90.40%   -0.02%     
==========================================
  Files         262      262              
  Lines       18234    18244      +10     
==========================================
+ Hits        16488    16494       +6     
- Misses       1746     1750       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

brsnw250 · 2024-12-27T13:05:40Z

etna/datasets/tsdataset.py

-        test._regressors = deepcopy(self.regressors)
-        test._target_components_names = deepcopy(self.target_components_names)
-        test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
+        train = deepcopy(self)


I think we should come up with a more economical solution in terms of resource usage. Currently here we might occupy up to 3x of memory compared to the original size. Also here we are unnecessarily copying df_exog when reference to the original could be used. Here we could try to utilize pandas copy/view semantics instead of explicitly copying dataframes.

I added new version of code, but it should be discussed: there are some possible problems.

tests/test_datasets/test_dataset.py

d-a-bunin · 2024-12-27T15:17:33Z

etna/datasets/tsdataset.py

-        test._regressors = deepcopy(self.regressors)
-        test._target_components_names = deepcopy(self.target_components_names)
-        test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
+        # TODO: there is a risk that some methods with inplace=True dropping of columns will affect train and test dataframes,


We should discuss this part.

Shouldn't such operations modify only df, which is copied ?

d-a-bunin · 2024-12-27T15:18:00Z

etna/datasets/tsdataset.py

+            test = deepcopy(self)
+            test.df = self_df.loc[test_start_defined:test_end_defined]
+            test.raw_df = self_raw_df.loc[train_start_defined:test_end_defined]
+            # we do this to optimize memory consumption


This line could be moved to 1231.

I don't think that it is necessary to repeat this comment

brsnw250 · 2025-01-10T08:29:52Z

etna/datasets/tsdataset.py

+            test = deepcopy(self)
+            test.df = self_df.loc[test_start_defined:test_end_defined]
+            test.raw_df = self_raw_df.loc[train_start_defined:test_end_defined]
+            # we do this to optimize memory consumption


I don't think that it is necessary to repeat this comment

brsnw250 · 2025-01-10T08:33:11Z

etna/datasets/tsdataset.py

-        test._regressors = deepcopy(self.regressors)
-        test._target_components_names = deepcopy(self.target_components_names)
-        test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
+        # TODO: there is a risk that some methods with inplace=True dropping of columns will affect train and test dataframes,


Shouldn't such operations modify only df, which is copied ?

brsnw250 · 2025-01-10T08:43:26Z

tests/test_datasets/test_dataset.py

    assert sorted(train.target_components_names) == sorted(ts_with_target_components.target_components_names)
    assert sorted(test.target_components_names) == sorted(ts_with_target_components.target_components_names)
+    assert set(train_target_components.columns.get_level_values("feature")) == set(train.target_components_names)
+    assert set(test_target_components.columns.get_level_values("feature")) == set(test.target_components_names)


 def test_train_test_split_pass_prediction_intervals_to_output(ts_with_prediction_intervals):


It would be nice to add a similar test for hierarchy, testing that structure is copied and current levels are preserved. Seems like we lack this test in general.

d-a-bunin · 2025-01-10T12:47:42Z

etna/datasets/tsdataset.py

-        test._target_components_names = deepcopy(self.target_components_names)
-        test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
+        self_df = self.df
+        self_raw_df = self.raw_df


I think we shouldn't do the same for df_exog, because deepcopy(self) should handle it properly.

d-a-bunin · 2025-01-10T12:48:39Z

etna/datasets/tsdataset.py

+
+            # we want to make sure it makes only one copy
+            train_df = self_df.loc[train_start_defined:train_end_defined]
+            if train_df._is_view:


On my machine this kind of slice gives _is_view, but I don't really know if it always the case for all OS/pandas versions.

Let's do

if train_df._is_view or train_df._is_copy is not None

d-a-bunin · 2025-01-10T12:50:08Z

etna/datasets/tsdataset.py

+
+            # we want to make sure it makes only one copy
+            train_df = self_df.loc[train_start_defined:train_end_defined]
+            if train_df._is_view:


I'm not sure if we should use private is_view. Tests will show if it is available on all pandas versions.

fix: rework train_test_split and tests for it

13b375b

d-a-bunin self-assigned this Dec 25, 2024