Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop Python 3.8, add Python 3.11 and 3.12, and use dd.from_map #81

Merged
merged 8 commits into from
Jul 18, 2024

Conversation

jrbourbeau
Copy link
Member

Let's see if CI is happy here

xref #80

@jrbourbeau jrbourbeau changed the title Drop Python 3.8 and add Python 3.11 Drop Python 3.8, add Python 3.11 and 3.12, and use dd.from_map Jul 17, 2024
Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I've revived this PR and also expanded scope a bit. This PR:

  • Drops Python 3.8
  • Adds Python 3.11 and 3.12
  • Fixes pre-commit CI build
  • Gets tests passing
  • Switches from using DataFrameIOLayer to the more modern dd.from_map

Comment on lines +112 to 130
@pytest.fixture(scope="module")
def project_id():
project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID")
if not project_id:
_, project_id = google.auth.default()

yield project_id


@pytest.fixture
def google_creds():
env_creds_file = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")
if env_creds_file:
credentials = json.load(open(env_creds_file))
else:
if os.environ.get("GOOGLE_APPLICATION_CREDENTIALS"):
credentials = json.load(open(os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")))
elif os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS"):
credentials = json.loads(os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS"))
else:
credentials, _ = google.auth.default()

yield credentials
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes keep the current testing setup behavior (i.e. support DASK_BIGQUERY_GCP_CREDENTIALS, DASK_BIGQUERY_PROJECT_ID, GOOGLE_APPLICATION_CREDENTIALS environment variables), but now fallback to Google's default auth if those aren't set (a better experience IMO).

This made it more straightforward for me to run things locally, while also not rocking the boat too much with our current CI setup.

Comment on lines +38 to +40
df = pd.DataFrame(records)
df["timestamp"] = df["timestamp"].astype("datetime64[us, UTC]")
yield df
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a us cast here. Previously this has ns time resolution. From what I can tell bigquery only stores timestamps up to us resolution (see docs https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp_type and stack overflow https://stackoverflow.com/a/44307611). Without this cast, assert_eq starts raising due to timestamp resolution mismatches.

I'll admit I'm a bit stumped here. Clearly this used to work in the past somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast maybe you have a sense for any recent changes, or if I'm just wrong here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started hitting similar issues at some point when I bumped pandas versions, could you be running a different version here than in the past?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We noticed some weirdness around Pandas 2.0 where we started getting microsecond precision back.

BigQuery itself hasn't change AFAIK. We should always respond with us precision in the Arrow we return. I think it's just what pandas does with that now that's changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah pandas started supporting non nanosecond resolution with 2.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay, thanks all for clarifying. I'm inclined to just go with the small test change here. We can always handle things in a follow-up PR as needed.



def test_read_gbq(df, table, client):
project_id, dataset_id, table_id = table
ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id)

assert list(ddf.columns) == ["name", "number", "timestamp", "idx"]
assert list(df.columns) == list(ddf.columns)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just generalizing a bit to make things more resilient to changing column names in the test DataFrame.

@@ -181,7 +179,7 @@ def write_existing_dataset(google_creds):
[
("name", pa.string()),
("number", pa.uint8()),
("timestamp", pa.timestamp("ns")),
("timestamp", pa.timestamp("us")),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corresponding change given the us resolution

Comment on lines +16 to +17
- pip:
- git+https://github.com/dask/dask
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we need dask/dask#11233 for some tests to pass. A little unfortunate. Maybe we can change tests to avoid the timezone issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also remove it again after the release tomorrow, so shouldn't be an issue

output_name,
meta.columns,
[stream.name for stream in session.streams],
return dd.from_map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pass the label into from_map to make the task prefix more descriptive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jrbourbeau jrbourbeau merged commit 6e6bc41 into main Jul 18, 2024
13 checks passed
@jrbourbeau jrbourbeau deleted the drop-py38 branch July 18, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants