-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Improve convolution performance for Sparse variables #411
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had urge to whine I guess ;-)
Will check in more detail when home with a laptop
source=var.source, sampling_rate=var.sampling_rate) | ||
if isinstance(var, SparseRunVariable): | ||
return SparseRunVariable( | ||
name=var.name, values=convolved[0], onset=onsets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the pr, but unchecked assumptions started as [0]
is a recipe for a trouble or a cryptic error. I believe we addressed some of such, but it would be nice if new code with checks/proper errors or at least assert statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean convolved[0]
? I blame someone else! Maybe if it returns a dictionary?
Ah, and how to get such a neat profile figure, please teach me |
|
Codecov Report
@@ Coverage Diff @@
## master #411 +/- ##
==========================================
- Coverage 62.3% 62.25% -0.06%
==========================================
Files 27 27
Lines 4555 4554 -1
Branches 1173 1173
==========================================
- Hits 2838 2835 -3
- Misses 1433 1434 +1
- Partials 284 285 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #411 +/- ##
==========================================
+ Coverage 62.3% 62.39% +0.08%
==========================================
Files 27 27
Lines 4555 4560 +5
Branches 1173 1173
==========================================
+ Hits 2838 2845 +7
+ Misses 1433 1432 -1
+ Partials 284 283 -1
Continue to review full report at Codecov.
|
A few more profiling/timing results. This is passing I timed how long it would take to run Since the 1hz: 1.2s Now the question is whether its reasonable to downsample to TR (or a small factor above that), assuming |
I think if we want to be smart about it, for no information loss to occur, you'd want to set combination of For example, in the case of I tested out using an adaptive oversampling, such that it is set to the minimum distance between events * sampling_rate of frame time. In my testing, this seemed to work quite well, even when returning events at 10hz (which is what the previous behavior was). This is also only took 1 second to run |
Cool, glad that works. But we should probably use the minimum of event duration rather than distance between events (or minimum of both). In the naturalistic context, these will generally coincide (e.g., for uniformly sampled measurements, duration will generally match distance to next sample), but in many other contexts, you can have widely-spaced but very short events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense... I definitely think scaling the oversampling inversely with the actual sampling rate is sensible.
This is going to make refactoring #376 pretty awful, but that's my fault for not getting that merged first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost ready. Is there a hold-up I'm not aware of being signaled by [WIP]?
Co-Authored-By: adelavega <[email protected]>
Ready to merge as soon as tests pass. Oddly I can't edit the title... |
Yeah, GitHub's UI seems to be glitchy. I was able to edit it in a tab I had open from earlier. |
Reviewing this now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments, otherwise LGTM.
@@ -49,11 +54,18 @@ def _transform(self, var, model='spm', derivative=False, dispersion=False, | |||
elif model != 'fir': | |||
raise ValueError("Model must be one of 'spm', 'glover', or 'fir'.") | |||
|
|||
convolved = hrf.compute_regressor(vals, model, onsets, | |||
fir_delays=fir_delays, min_onset=0) | |||
min_interval = min(np.ediff1d(np.sort(var.onset)).min(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here for posterity explaining min_interval
—it will reduce the maintenance burden a year or two down the line
fir_delays=fir_delays, min_onset=0) | ||
min_interval = min(np.ediff1d(np.sort(var.onset)).min(), | ||
var.duration.min()) | ||
oversampling = np.ceil(1 / (min_interval * sampling_rate)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a test to make sure that oversampling
is computing properly from the input parameters (which might require mocking compute_regressor
).
Okay @tyarkoni, I added tests with dense and sparse variables, and mocked Two notes:
|
Ugh, go away python 2. I guess we need to all the mock library. |
Codecov Report
@@ Coverage Diff @@
## master #411 +/- ##
==========================================
+ Coverage 62.3% 62.39% +0.08%
==========================================
Files 27 27
Lines 4555 4560 +5
Branches 1173 1173
==========================================
+ Hits 2838 2845 +7
+ Misses 1433 1432 -1
+ Partials 284 283 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #411 +/- ##
==========================================
+ Coverage 62.3% 62.37% +0.07%
==========================================
Files 27 27
Lines 4555 4564 +9
Branches 1173 1174 +1
==========================================
+ Hits 2838 2847 +9
Misses 1433 1433
Partials 284 284
Continue to review full report at Codecov.
|
It's worth experimenting with the convolution code to make sure small oversampling values don't do wonky things. That issue aside, we should probably also always double the value you're currently using. The minimum of event durations and onset deltas seems like a reasonable approximation of the highest-frequency signal in the timeseries, but we want to make sure we're above the Nyquist rate (i.e., 2 * the highest frequency). This should make a big difference in many cases. |
In my short look at it, it didn't seem to make a difference, since oversampling is essentially already done by requesting high frequency frame-times, but it's worth testing in more detail, and keeping an eye on it. I'm fine with doubling the oversampling rate |
Turns out this line is (potentially) inaccurate given a float duration:
Or at least it's potentially inconsistent with how I think #361 deserves its own fix (throw error if index and values don't match at |
Anybody want to give this a final review? If not, I will merge soon as its already been reviewed and seems to be working well for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sampling_rate = self.collection.sampling_rate | ||
dur = var.get_duration() | ||
resample_frames = np.linspace( | ||
0, dur, int(math.ceil(dur * sampling_rate)), endpoint=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should center the sampled frames within the timeseries... e.g., suppose we have a 2-second dense timeseries, and we want to downsample to 2 Hz. Currently we would sample at 0, 0.5, 1, and 1.5. Probably we should do 0.25, 0.75, 1.25, and 1.75. Let's not change anything here, because if we were going to do this, we'd need to do it throughout the codebase for consistency. Mostly just making a mental note.
Fixes #354 and related to #356
Profiling indicates that the slow function is actually
np.convolve
itself. It callsnp.correlate
which takes exponentially longer as the variable grows in length.(see: numpy/numpy#1858)
The good news is that by not densifying prior to convolution things go much faster.
For example, a predictor with ~850 events takes about 15ms as sparse, but 1.03s when upsampled to 10hz. 50hz takes that up to about a minute (I did the profiling on that).
The question that remains is how to downsample at the end. As it is, it will use the original onsets as the
frame_times
. That is, it will resample only at those onsets. Does that make sense? Or would uniform resampling at the TR (or some factor above that), be better? Maybe we can even do 10hz resampling, although presumably this should be the final step inTransformations
and TR should be OK.