Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework TSDataset.train_test_split to pass all features to train and test parts #545

Merged
merged 10 commits into from
Jan 13, 2025

Conversation

d-a-bunin
Copy link
Collaborator

@d-a-bunin d-a-bunin commented Dec 25, 2024

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

Rework TSDataset.train_test_split to pass all features to train and test parts

Closing issues

Closes #272.

@d-a-bunin d-a-bunin self-assigned this Dec 25, 2024
etna/datasets/tsdataset.py Outdated Show resolved Hide resolved
etna/datasets/tsdataset.py Outdated Show resolved Hide resolved
etna/datasets/tsdataset.py Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Dec 25, 2024

🚀 Deployed on https://deploy-preview-545--etna-docs.netlify.app

@github-actions github-actions bot temporarily deployed to pull request December 25, 2024 12:31 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 25, 2024 13:04 Inactive
@d-a-bunin
Copy link
Collaborator Author

I suppose, the current failing tests are connected to Explanation:

  • Inside of Pipeline.fit we don't revert the dataset fully to the initial state and keep some of the columns. As a result, after fit there could be additional columns like date flags
  • Inside of backtest we are making train_test_split. Previously, train_test_split kept only columns from raw_df, and it removed columns created in Pipeline.fit, now it isn't happening and these columns are kept in train and test parts.
  • We apply transforms on train part and it creates duplicate columns with date flags

Possible solutions

  • Change logic of Pipeline.fit and just make a copy before applying transforms.
    • It guarantees that we don't break anything inside transforms
    • It increases memory consumption
  • Revert TSDataset.train_test_split to current logic
    • Current logic is contradictory, because it puts _regressors, target_components_names and _prediction_intervals_names columns that don't exist in result train and test datasets
  • Rework train_test_split to give "empty" datasets only with columns from raw_df
    • In that case we shouldn't set _regressors, target_components_names and _prediction_intervals_names
    • It is strange that we take columns from raw_df, but values from df, it could lead to strange results

This problem could be related to #440.

@d-a-bunin d-a-bunin requested a review from brsnw250 December 26, 2024 07:40
etna/datasets/tsdataset.py Outdated Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to pull request December 26, 2024 15:31 Inactive
Copy link

codecov bot commented Dec 26, 2024

Codecov Report

Attention: Patch coverage is 87.87879% with 4 lines in your changes missing coverage. Please review.

Project coverage is 90.40%. Comparing base (d192f84) to head (49e022d).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
etna/datasets/tsdataset.py 84.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
- Coverage   90.42%   90.40%   -0.02%     
==========================================
  Files         262      262              
  Lines       18234    18244      +10     
==========================================
+ Hits        16488    16494       +6     
- Misses       1746     1750       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot temporarily deployed to pull request December 27, 2024 07:36 Inactive
test._regressors = deepcopy(self.regressors)
test._target_components_names = deepcopy(self.target_components_names)
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
train = deepcopy(self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should come up with a more economical solution in terms of resource usage. Currently here we might occupy up to 3x of memory compared to the original size. Also here we are unnecessarily copying df_exog  when reference to the original could be used. Here we could try to utilize pandas copy/view semantics instead of explicitly copying dataframes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new version of code, but it should be discussed: there are some possible problems.

tests/test_datasets/test_dataset.py Show resolved Hide resolved
tests/test_datasets/test_dataset.py Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to pull request December 27, 2024 15:12 Inactive
test._regressors = deepcopy(self.regressors)
test._target_components_names = deepcopy(self.target_components_names)
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
# TODO: there is a risk that some methods with inplace=True dropping of columns will affect train and test dataframes,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss this part.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't such operations modify only df, which is copied ?

test = deepcopy(self)
test.df = self_df.loc[test_start_defined:test_end_defined]
test.raw_df = self_raw_df.loc[train_start_defined:test_end_defined]
# we do this to optimize memory consumption
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could be moved to 1231.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it is necessary to repeat this comment

test = deepcopy(self)
test.df = self_df.loc[test_start_defined:test_end_defined]
test.raw_df = self_raw_df.loc[train_start_defined:test_end_defined]
# we do this to optimize memory consumption
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it is necessary to repeat this comment

test._regressors = deepcopy(self.regressors)
test._target_components_names = deepcopy(self.target_components_names)
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
# TODO: there is a risk that some methods with inplace=True dropping of columns will affect train and test dataframes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't such operations modify only df, which is copied ?

assert sorted(train.target_components_names) == sorted(ts_with_target_components.target_components_names)
assert sorted(test.target_components_names) == sorted(ts_with_target_components.target_components_names)
assert set(train_target_components.columns.get_level_values("feature")) == set(train.target_components_names)
assert set(test_target_components.columns.get_level_values("feature")) == set(test.target_components_names)


def test_train_test_split_pass_prediction_intervals_to_output(ts_with_prediction_intervals):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a similar test for hierarchy, testing that structure is copied and current levels are preserved. Seems like we lack this test in general.

@github-actions github-actions bot temporarily deployed to pull request January 10, 2025 12:42 Inactive
@d-a-bunin d-a-bunin requested a review from brsnw250 January 10, 2025 12:47
test._target_components_names = deepcopy(self.target_components_names)
test._prediction_intervals_names = deepcopy(self._prediction_intervals_names)
self_df = self.df
self_raw_df = self.raw_df
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't do the same for df_exog, because deepcopy(self) should handle it properly.


# we want to make sure it makes only one copy
train_df = self_df.loc[train_start_defined:train_end_defined]
if train_df._is_view:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine this kind of slice gives _is_view, but I don't really know if it always the case for all OS/pandas versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do

if train_df._is_view or train_df._is_copy is not None


# we want to make sure it makes only one copy
train_df = self_df.loc[train_start_defined:train_end_defined]
if train_df._is_view:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should use private is_view. Tests will show if it is available on all pandas versions.

@github-actions github-actions bot temporarily deployed to pull request January 13, 2025 09:10 Inactive
@github-actions github-actions bot temporarily deployed to pull request January 13, 2025 10:39 Inactive
@d-a-bunin d-a-bunin merged commit a9fcedf into master Jan 13, 2025
17 checks passed
@d-a-bunin d-a-bunin deleted the issue-272 branch January 13, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Inconsistent state after TSDataset.train_test_split
2 participants