-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop Python 3.8, add Python 3.11 and 3.12, and use dd.from_map
#81
Conversation
dd.from_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I've revived this PR and also expanded scope a bit. This PR:
- Drops Python 3.8
- Adds Python 3.11 and 3.12
- Fixes
pre-commit
CI build - Gets tests passing
- Switches from using
DataFrameIOLayer
to the more moderndd.from_map
@pytest.fixture(scope="module") | ||
def project_id(): | ||
project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID") | ||
if not project_id: | ||
_, project_id = google.auth.default() | ||
|
||
yield project_id | ||
|
||
|
||
@pytest.fixture | ||
def google_creds(): | ||
env_creds_file = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") | ||
if env_creds_file: | ||
credentials = json.load(open(env_creds_file)) | ||
else: | ||
if os.environ.get("GOOGLE_APPLICATION_CREDENTIALS"): | ||
credentials = json.load(open(os.environ.get("GOOGLE_APPLICATION_CREDENTIALS"))) | ||
elif os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS"): | ||
credentials = json.loads(os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS")) | ||
else: | ||
credentials, _ = google.auth.default() | ||
|
||
yield credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes keep the current testing setup behavior (i.e. support DASK_BIGQUERY_GCP_CREDENTIALS
, DASK_BIGQUERY_PROJECT_ID
, GOOGLE_APPLICATION_CREDENTIALS
environment variables), but now fallback to Google's default auth if those aren't set (a better experience IMO).
This made it more straightforward for me to run things locally, while also not rocking the boat too much with our current CI setup.
df = pd.DataFrame(records) | ||
df["timestamp"] = df["timestamp"].astype("datetime64[us, UTC]") | ||
yield df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a us
cast here. Previously this has ns
time resolution. From what I can tell bigquery only stores timestamps up to us
resolution (see docs https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp_type and stack overflow https://stackoverflow.com/a/44307611). Without this cast, assert_eq
starts raising due to timestamp resolution mismatches.
I'll admit I'm a bit stumped here. Clearly this used to work in the past somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast maybe you have a sense for any recent changes, or if I'm just wrong here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started hitting similar issues at some point when I bumped pandas versions, could you be running a different version here than in the past?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We noticed some weirdness around Pandas 2.0 where we started getting microsecond precision back.
BigQuery itself hasn't change AFAIK. We should always respond with us
precision in the Arrow we return. I think it's just what pandas does with that now that's changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah pandas started supporting non nanosecond resolution with 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm okay, thanks all for clarifying. I'm inclined to just go with the small test change here. We can always handle things in a follow-up PR as needed.
|
||
|
||
def test_read_gbq(df, table, client): | ||
project_id, dataset_id, table_id = table | ||
ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id) | ||
|
||
assert list(ddf.columns) == ["name", "number", "timestamp", "idx"] | ||
assert list(df.columns) == list(ddf.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just generalizing a bit to make things more resilient to changing column names in the test DataFrame.
@@ -181,7 +179,7 @@ def write_existing_dataset(google_creds): | |||
[ | |||
("name", pa.string()), | |||
("number", pa.uint8()), | |||
("timestamp", pa.timestamp("ns")), | |||
("timestamp", pa.timestamp("us")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corresponding change given the us
resolution
- pip: | ||
- git+https://github.com/dask/dask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we need dask/dask#11233 for some tests to pass. A little unfortunate. Maybe we can change tests to avoid the timezone issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also remove it again after the release tomorrow, so shouldn't be an issue
output_name, | ||
meta.columns, | ||
[stream.name for stream in session.streams], | ||
return dd.from_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you pass the label into from_map to make the task prefix more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Let's see if CI is happy here
xref #80