Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend STLDecomposer to Support Multiseries #4253

Merged
merged 57 commits into from
Aug 31, 2023

Conversation

remyogasawara
Copy link
Collaborator

Resolves #4244

@codecov
Copy link

codecov bot commented Jul 31, 2023

Codecov Report

Patch coverage: 100.0% and project coverage change: +0.1% 🎉

Comparison is base (90033c5) 99.7% compared to head (3cc6cf3) 99.7%.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #4253     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        355     355             
  Lines      39155   39458    +303     
=======================================
+ Hits       39035   39338    +303     
  Misses       120     120             
Files Changed Coverage Δ
...omponents/transformers/preprocessing/decomposer.py 99.4% <100.0%> (+0.1%) ⬆️
...nents/transformers/preprocessing/stl_decomposer.py 100.0% <100.0%> (ø)
...omponent_tests/decomposer_tests/test_decomposer.py 100.0% <100.0%> (ø)
...sts/decomposer_tests/test_polynomial_decomposer.py 100.0% <100.0%> (ø)
...nent_tests/decomposer_tests/test_stl_decomposer.py 100.0% <100.0%> (ø)
evalml/tests/conftest.py 98.4% <100.0%> (+0.1%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

eccabay and others added 3 commits August 2, 2023 16:32
* Add unstacking function

* Add stacking function

* Add tests for both functions
* Squashed changes

* Ignored index

* Disabled column checking

* Reverted deleted code

* Updated pyproject.toml

* Replaced version check code
@remyogasawara remyogasawara force-pushed the 4244_extend_stldecomp_for_multiseries branch from 1faf23d to 4a8cc0f Compare August 2, 2023 23:37
Copy link
Contributor

@christopherbunn christopherbunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smaller nits. Only reviewed implemetation so far but will review tests later.

@remyogasawara remyogasawara requested a review from eccabay August 18, 2023 21:30
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making awesome progress! I left a few more comments, mostly just condensing code

series_y = y[id]

# Determine the period of the seasonal component
if id not in self.periods or self.period is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very small use case, but users should be able to pass in self.periods when initializing the decomposer, rather than necessarily defining it here (our detection is decent, but if a user knows what the periods should be, they should be able to pass that in for better results)

And unless I'm missing something, it doesn't look like we actually use self.period anywhere any more, because even in the single series case, we're still indexing into self.periods. Am I correct? If so, I think we should just swap out one for the other entirely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should have taken self.period out. I think I kept it to keep the functionality of set_period the same, but is the function used in the PolynomialDecomposer at all because I see that it is tested on both decomposers in test_decomposer_set_period. Also would the periods parameter be same for the single series case or would it be fine for period to be an integer for single series and then take in a dictionary for multiseries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_period is tested on both decomposers because it's implemented in the parent class, since we have to test its implementation on one subclass, why not test it on both? Feel free to bastardize whatever preexisting functions you need to - I could totally see the case for eliminating set_period entirely and just calling _determine_periodicity directly, or having set_period calculate the periods for all series and save them in self.periods instead of self.period. It's up to you, IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to allow users to still pass in a period for both multivariate and univariate cases, I was thinking of keeping both self.period and self.periods as parameters, but that probably isn't the most efficient implementation so I'm open to other suggestions 😅

evalml/tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final comments, but nothing blocking. Awesome work!

Comment on lines 218 to 222
if self.period is None and len(y.columns) == 1:
self.period = period
self.update_parameters({"period": self.period})
elif self.period is not None and len(y.columns) == 1:
period = self.period
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's ok if we call self.update_parameters an extra time - we can collapse these into a single if len(y.columns)==1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since everything is accessing the period through self.periods I'm thinking we might not even need to update self.period at all, instead just use it to set period if its given by a user

y = y.to_frame()
series_results = {}
# Iterate through each series id
for id in y.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so late in the process to realize this, but do we actually need to be calling _decompose_target in a loop? Since the decomposer handles multiseries generally, it should be able to handle multiseries here too, we can pass self.periods through instead of period=period, and switch around the logic to return the data in the format we need? That will prevent us from calling Decomposer.fit() too many times, which can be slow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified this so that get_trend_dataframe returns a dictionary for multiseries, but for single series it is still returning a list of dataframes. In the PolynomialDecomposer, get_trend_dataframe returns a list of dataframes. I think this is because there is some multivariate implementation written here. Since they're both using plot_decomposition for single series, I wanted to keep the indexing the same so I left it as a list for STLDecomposer for now (even though it can be modified later to just return a dataframe). This is pretty inconsistent so I wanted to see what others think about it and if I should consider changing the return type of get_trend_dataframe for the multiseries and/or single series STLDecomposer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated get_trend_dataframe() in the STLDecomposer to return list(pd.DataFrame) for single series and dict(list(pd.DataFrame)) for multiseries, but it should be updated in the future to no longer be in a list #4294

Copy link
Contributor

@christopherbunn christopherbunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the test case Becca mentioned is covered!

@remyogasawara remyogasawara merged commit 69344b2 into main Aug 31, 2023
@remyogasawara remyogasawara deleted the 4244_extend_stldecomp_for_multiseries branch August 31, 2023 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend STLDecomposer to support Multiseries
4 participants