Skip to content

Conversation

ian-r-rose
Copy link
Contributor

This adds a new test measuring the performance of shuffling based on the different options for string dtypes: "object", "string[python]", and "string[pyarrow]". In my initial testing, the pyarrow string dtype was significantly slower (!), though I haven't had the chance to chase down exactly what is going on there (possibly time spent converting dtypes, possibly performance issues with hashing or serialization, possibly something else entirely). Something to fix, I suppose!

@jrbourbeau jrbourbeau self-assigned this Nov 15, 2022
@ncclementi
Copy link
Contributor

@jrbourbeau I thought I could help here a bit, I suspect we should see all green now, between the skipif < 2022.10.1 and merging main. I did not check the rest of the test though, like technical things and design.

@hayesgb
Copy link
Contributor

hayesgb commented Nov 29, 2022

I'm curious about the motivation for this test. Seems like it makes more sense to specify a string dtype for benchmarking and monitor behavior there. Thoughts?

@ncclementi
Copy link
Contributor

I'm curious about the motivation for this test. Seems like it makes more sense to specify a string dtype for benchmarking and monitor behavior there. Thoughts?

I'm not sure I follow what do you mean by specifying a string dtype for benchmarking. Do you mean in the h2o benchmarks? If that's the case, then I think you'd like to avoid the s3 reading to isolate the study only to the type. Which is what this test does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants