Add test for shuffling on different string dtypes #495

ian-r-rose · 2022-11-04T20:33:06Z

This adds a new test measuring the performance of shuffling based on the different options for string dtypes: "object", "string[python]", and "string[pyarrow]". In my initial testing, the pyarrow string dtype was significantly slower (!), though I haven't had the chance to chase down exactly what is going on there (possibly time spent converting dtypes, possibly performance issues with hashing or serialization, possibly something else entirely). Something to fix, I suppose!

ncclementi · 2022-11-16T22:06:32Z

@jrbourbeau I thought I could help here a bit, I suspect we should see all green now, between the skipif < 2022.10.1 and merging main. I did not check the rest of the test though, like technical things and design.

hayesgb · 2022-11-29T15:35:44Z

I'm curious about the motivation for this test. Seems like it makes more sense to specify a string dtype for benchmarking and monitor behavior there. Thoughts?

ncclementi · 2022-12-02T18:34:56Z

I'm curious about the motivation for this test. Seems like it makes more sense to specify a string dtype for benchmarking and monitor behavior there. Thoughts?

I'm not sure I follow what do you mean by specifying a string dtype for benchmarking. Do you mean in the h2o benchmarks? If that's the case, then I think you'd like to avoid the s3 reading to isolate the study only to the type. Which is what this test does.

Add test for shuffling on different string dtypes

fe6a8a0

jrbourbeau self-assigned this Nov 15, 2022

ncclementi added 2 commits November 16, 2022 14:43

pandas dtypes not supported in <2022.10.1

94d13e3

Merge branch 'main' into shuffle-string-dtypes

b50caba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add test for shuffling on different string dtypes #495

Add test for shuffling on different string dtypes #495

Uh oh!

ian-r-rose commented Nov 4, 2022

Uh oh!

ncclementi commented Nov 16, 2022

Uh oh!

hayesgb commented Nov 29, 2022

Uh oh!

ncclementi commented Dec 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add test for shuffling on different string dtypes #495

Are you sure you want to change the base?

Add test for shuffling on different string dtypes #495

Uh oh!

Conversation

ian-r-rose commented Nov 4, 2022

Uh oh!

ncclementi commented Nov 16, 2022

Uh oh!

hayesgb commented Nov 29, 2022

Uh oh!

ncclementi commented Dec 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants