Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make InterceptTerms generate Bools to avoid unnecessary promotion of other columns #294

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kleinschmidt
Copy link
Member

@kleinschmidt kleinschmidt commented May 26, 2023

This is one possible fix for #293. As described in that thread, the bottleneck here is actually that InterceptTerms generate Array{Float64}s, which then get hcat with other columns and promote them to Float64. With this change, normal julia promotion rules apply, and values will only be promoted as needed.

First adding tests to confirm that this behavior is currently broken, then will push teh fix and open for review when CI is green.

I will add that this is potentially breaking: users may end up seeing eltypes that are not even floats come out of modelcols if none of the columns in their input tables are floats. I think a lot of this can be remediated by having downstream dependents make sure the y, x arrays returned by modelcols(::FormulaTerm) are actually float arrays with the appropriate eltype...

@kleinschmidt kleinschmidt marked this pull request as ready for review May 26, 2023 21:58
@kleinschmidt kleinschmidt requested review from nalimilan and palday May 26, 2023 21:58
@palday
Copy link
Member

palday commented May 26, 2023

I'm generally onboard with this, but we should probably do something like a pkgeval to see if we would actually break anything. But we're also not so near the bottom of the dependency tree as StatsBase so a formally breaking release wouldn't be as bad as watching e.g. the StatsBase 0.34 changes propagate.

Comment on lines 420 to 421
contr = DummyCoding(; base=4, levels=1:4)
d = (; y=rand(Float32, 4), x=rand(Float32, 4), z=[1:4; ])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't yet used below, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used but it's redundant with the definition above...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait the contr part? yeah that's right.

@nalimilan
Copy link
Member

+1

We could tag 1.0, as it was the plan already (removing TableRegressionModel): #271.

Comment on lines +552 to +553
modelcols(t::InterceptTerm{true}, d::NamedTuple) = ones(Bool, size(first(d)))
modelcols(t::InterceptTerm{false}, d) = Matrix{Bool}(undef, size(first(d),1), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to generate BitArrays here? I guess it doesn't matter much, but it would save some memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd kinda hoped to be able to just use a scalar but hcat doesn't work with that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants