Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert automatically to arrow strings #86

Merged
merged 7 commits into from
Jul 25, 2024
Merged

Convert automatically to arrow strings #86

merged 7 commits into from
Jul 25, 2024

Conversation

phofl
Copy link
Contributor

@phofl phofl commented Jul 24, 2024

closes #85

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phofl

dask_bigquery/core.py Outdated Show resolved Hide resolved
Comment on lines +241 to +245
arrow_options_meta = arrow_options.copy()
if pyarrow_strings_enabled():
types_mapper = _get_types_mapper(arrow_options.get("types_mapper", {}.get))
if types_mapper is not None:
arrow_options_meta["types_mapper"] = types_mapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this twice, once here and once in bigquery_read. Thoughts on keeping this here, passing arrow_options (with the correct types_mapper) through to dd.from_map below? That way we could drop the convert_string= parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this too.

I'd like to avoid serialising that stuff in the graph, just passing a flag seems a lot easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Are the types_mapper entries we're adding particularly large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds a few callables, which doesn't seem like a good idea (didn't do any profiling)

@@ -387,6 +388,21 @@ def test_arrow_options(table):
assert ddf.dtypes["name"] == pd.StringDtype(storage="pyarrow")


@pytest.mark.parametrize("convert_string", [True, False])
def test_convert_string(table, convert_string):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks nice. It'd be good to include an assert_eq check too to make sure the values are as expected, not just the name column dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this in a few other places already and I can't run tests locally, so don't want to mess around with things too much

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Just pushed a small commit to update this (and one other) test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx

phofl and others added 2 commits July 25, 2024 17:06
@jrbourbeau jrbourbeau merged commit 1215a51 into main Jul 25, 2024
13 checks passed
@jrbourbeau jrbourbeau deleted the arrow_strings branch July 25, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_gbq casts string columns to objects
2 participants