Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow for large datasets #5

Open
tueda opened this issue Dec 5, 2023 · 2 comments
Open

Slow for large datasets #5

tueda opened this issue Dec 5, 2023 · 2 comments

Comments

@tueda
Copy link

tueda commented Dec 5, 2023

Thank you for creating a nice package. It is very handy to install such a functionality with pip.

However, I find this package to be rather slow for large datasets in comparison with the LinearRegDiagnostic class described in the Linear regression diagnostics example of statsmodels. This may indicate some inefficiency of the package.

Example:

df = sm.datasets.get_rdataset("ames", "openintro").data
res = smf.ols("np.log10(price) ~ Q('Overall.Qual') + np.log(area)", df).fit()

lmdiag

%%time
lmdiag.plot(res)
CPU times: user 15.1 s, sys: 215 ms, total: 15.3 s
Wall time: 16.1 s

LinearRegDiagnostic

%%time
LinearRegDiagnostic(res)()
CPU times: user 2.17 s, sys: 125 ms, total: 2.29 s
Wall time: 2.17 s
@dynobo
Copy link
Owner

dynobo commented Dec 11, 2023

Hi @tueda , thanks for your feedback!

Interesting to see LinearRegDiagnostic. Never seen it before, it certainly wasn't there when I implemented lmdiag 4 years ago.

Looking quickly over their code, they do some calculations smarter than I did!

To be honest, looking at lmdiag now, with 4 years more experience, I think its code quality is really poor. E.g. with the same calculations being repeated for the various plots etc...

Unfortunately, I don't have the sparetime for a rewrite (which would be best). I see some low effort optimizations, which could give ~20% speed-up, and which I might be able to do, but I'm not sure if it's worth it:

I wonder if it makes more sense to archive lmdiag and redirect people to the faster and more feature-rich LinearRegDiagnostic instead?

May I ask, why you would prefer to use lmdiag instead of LinearRegDiagnostic?

(It's a shame, that they implemented this only as an "example" (w/o license?) and not integrated it in statsmodels...)

@tueda
Copy link
Author

tueda commented Dec 16, 2023

Interesting to see LinearRegDiagnostic. Never seen it before, it certainly wasn't there when I implemented lmdiag 4 years ago.

Indeed, the example notebook has appeared in v0.14.0, which is rather new.

Unfortunately, I don't have the sparetime for a rewrite (which would be best). I see some low effort optimizations, which could give ~20% speed-up, and which I might be able to do, but I'm not sure if it's worth it:

I understand that you are busy and that it is not easy to devote time to maintaining and improving the library. This is not urgent.

May I ask, why you would prefer to use lmdiag instead of LinearRegDiagnostic?

Personally, I prefer the simple syntax lmdiag.plot(res) to LinearRegDiagnostic(res)(). Another important advantage of lmdiag is that it can be easily installed by pip. It is very handy for use with Google Colab etc.: much better than copying long code from the example notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants