-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove mean/var/min/max #223
base: master
Are you sure you want to change the base?
Conversation
Seems potentially quite breaking, as I imagine that some packages rely on being able to access the precomputed summary statistics for continuous terms. |
Right, it should be in the |
Nope. GLM or MixedModels might but I don't know for certain, which is why I said "potentially." Perhaps nothing is using it and it would be safe to remove, but we should just be aware of the breakage potential. |
I have some internal code that uses this. It might not be strictly necessary, but ideally I'd like to make sure there's some kind of mechanism for custom term types to request summary stats to be extracted from the data table at schema creation time. That's currently only possible via the hints Dict which is fine for things where you've manually specified the special handling but wouldn't work for things like splines implemented as functions in the formula (although those don't work very well with the current system either since ideally you need more than just mean/var/min/max for that). It's good to know that the schema is a performance bottleneck though. |
Actually, another strategy that is actually used in #183 is to replace |
Would it be possible to consider this for the next release? Even if this not the default, there should be some way to avoid the computation of these summary statistics (esp. min/max) as computing them takes as much time as doing a regression. using StatsModels
N = 10_000_000
df = (y = rand(N), x1 = rand(N), x2 = rand(N))
f = @formula(y~x1+x2)
@time StatsModels.schema(f, df)
# 0.164134 seconds (52 allocations: 3.297 KiB)
@time [df.x1 df.x2] \ df.y
# 0.128481 seconds (36 allocations: 381.536 MiB, 6.13% gc time) |
Unfortunately not, it would create breakage for users of at least one downstream package (StandardizedPredictors, I think) without an alternative mechanism for "requesting" that data invariants be computed (which is not gonna happen on the timeline we've set out for 1.0). I'd really like for 0.7/1.0 to be a "snapshot" that users can rely on without breakage vs. 0.6 as much as possible (package maintainers will have to deal w/ some breakage but we know what we're in for more or less 😂 )
One way to avoid the overhead that #183 quietly added support for is that Another thing that MIGHT be possible is to provide a |
Sure, I understand. |
Hm actually when I look at StandardizedPredictors, I think it's basically only the run-time constructors that use that information the |
Bump. This slows down simple regression and it would be better to change it before other packages start relying on it. |
solves #222