-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Edge case where a resampled column was too-long-by-one #365
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -448,12 +448,31 @@ def resample(self, sampling_rate, inplace=False, kind='linear'): | |
self.index = self._build_entity_index(self.run_info, sampling_rate) | ||
|
||
x = np.arange(n) | ||
num = int(np.ceil(n * sampling_rate / old_sr)) | ||
# In Index we trust! | ||
num = len(self.index) | ||
# _build_entity_index computation above does it per run, | ||
# which possibly provides a more stable solution and yadadada | ||
num_computed = int(np.ceil(n * sampling_rate / old_sr)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to change this to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would say that with any resampling it should be |
||
num_diff = abs(num - num_computed) | ||
if num_diff > 1: | ||
raise RuntimeError( | ||
"Our internal assumptions about resampling are too faulty " | ||
"to continue.") | ||
elif num_diff: | ||
assert num_diff == 1 # the only way to get here | ||
effigies marked this conversation as resolved.
Show resolved
Hide resolved
|
||
import warnings | ||
warnings.warn( | ||
"Probably due to multiple runs we ran into a slight " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This appears with a single run when TR=2.000001. I don't think this is a great error message. It's not clear that it's a warning that will do anything but cause confusion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we added it because there were no other way to inform user (e.g. logging) and we see how silence could be problematic to read down the road ;) again, with aforementioned suggested fix by just relying on |
||
"divergence between obtained number of samples in the index (computed " | ||
"per each run) and overall estimate down/up-sampled samples. " | ||
"But that is ok") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we looked at the code more, and our verdict is that we just need to get rid of num_computed here altogether and use the length of the index, estimated (by going through runs and using their actual duration which is not a subject of up/down sampling effects of "ints"). The reason is -- the But also what is important probably is that |
||
|
||
|
||
from scipy.interpolate import interp1d | ||
f = interp1d(x, self.values.values.ravel(), kind=kind) | ||
x_new = np.linspace(0, n - 1, num=num) | ||
self.values = pd.DataFrame(f(x_new)) | ||
assert len(self.values) == len(self.index) | ||
|
||
self.sampling_rate = sampling_rate | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this will hide other potential edge cases where there's a bigger mismatch... I suggest checking for a discrepancy and issuing a warning either if any is found, or if a large one is found (like you did in one of the other PRs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was an earlier version of this PR. I'm curious what kind of edge case you're talking about?
As I understand it, we have a run duration, which is the original TR * nvols.
For a given sampling rate, we find the nearest n such that
n / sampling_rate == duration
. So this line was a way of taking fromduration ~= n_old / sr_old ~= n_new / sr_new
, and calculatingn_new
fromn_old
,sr_old
, andsr_new
. Index instead calculates directly fromduration
andsr_new
. So the discrepancy we were finding is when you get an off-by-one between the two ways of calculatingn_new
.So it's not really clear where another discrepancy can arise.