Drop Python 3.8, add Python 3.11 and 3.12, and use `dd.from_map` #81

jrbourbeau · 2023-09-06T17:04:03Z

Let's see if CI is happy here

xref #80

…rop-py38

jrbourbeau

Okay, so I've revived this PR and also expanded scope a bit. This PR:

Drops Python 3.8
Adds Python 3.11 and 3.12
Fixes pre-commit CI build
Gets tests passing
Switches from using DataFrameIOLayer to the more modern dd.from_map

jrbourbeau · 2024-07-17T21:47:20Z

dask_bigquery/tests/test_core.py

+@pytest.fixture(scope="module")
+def project_id():
+    project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID")
+    if not project_id:
+        _, project_id = google.auth.default()
+
+    yield project_id
+
+
 @pytest.fixture
 def google_creds():
-    env_creds_file = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")
-    if env_creds_file:
-        credentials = json.load(open(env_creds_file))
-    else:
+    if os.environ.get("GOOGLE_APPLICATION_CREDENTIALS"):
+        credentials = json.load(open(os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")))
+    elif os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS"):
        credentials = json.loads(os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS"))
+    else:
+        credentials, _ = google.auth.default()
+
    yield credentials


These changes keep the current testing setup behavior (i.e. support DASK_BIGQUERY_GCP_CREDENTIALS, DASK_BIGQUERY_PROJECT_ID, GOOGLE_APPLICATION_CREDENTIALS environment variables), but now fallback to Google's default auth if those aren't set (a better experience IMO).

This made it more straightforward for me to run things locally, while also not rocking the boat too much with our current CI setup.

jrbourbeau · 2024-07-17T21:55:01Z

dask_bigquery/tests/test_core.py

+    df = pd.DataFrame(records)
+    df["timestamp"] = df["timestamp"].astype("datetime64[us, UTC]")
+    yield df


I've added a us cast here. Previously this has ns time resolution. From what I can tell bigquery only stores timestamps up to us resolution (see docs https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp_type and stack overflow https://stackoverflow.com/a/44307611). Without this cast, assert_eq starts raising due to timestamp resolution mismatches.

I'll admit I'm a bit stumped here. Clearly this used to work in the past somehow.

@tswast maybe you have a sense for any recent changes, or if I'm just wrong here

I started hitting similar issues at some point when I bumped pandas versions, could you be running a different version here than in the past?

We noticed some weirdness around Pandas 2.0 where we started getting microsecond precision back.

BigQuery itself hasn't change AFAIK. We should always respond with us precision in the Arrow we return. I think it's just what pandas does with that now that's changed.

Yeah pandas started supporting non nanosecond resolution with 2.0

Hmm okay, thanks all for clarifying. I'm inclined to just go with the small test change here. We can always handle things in a follow-up PR as needed.

jrbourbeau · 2024-07-17T21:55:53Z

dask_bigquery/tests/test_core.py



 def test_read_gbq(df, table, client):
    project_id, dataset_id, table_id = table
    ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id)

-    assert list(ddf.columns) == ["name", "number", "timestamp", "idx"]
+    assert list(df.columns) == list(ddf.columns)


Just generalizing a bit to make things more resilient to changing column names in the test DataFrame.

jrbourbeau · 2024-07-17T22:02:13Z

dask_bigquery/tests/test_core.py

@@ -181,7 +179,7 @@ def write_existing_dataset(google_creds):
                    [
                        ("name", pa.string()),
                        ("number", pa.uint8()),
-                        ("timestamp", pa.timestamp("ns")),
+                        ("timestamp", pa.timestamp("us")),


Corresponding change given the us resolution

jrbourbeau · 2024-07-17T22:04:12Z

ci/environment-3.11.yaml

+  - pip:
+    - git+https://github.com/dask/dask


This is because we need dask/dask#11233 for some tests to pass. A little unfortunate. Maybe we can change tests to avoid the timezone issue.

We can also remove it again after the release tomorrow, so shouldn't be an issue

phofl · 2024-07-18T09:23:25Z

dask_bigquery/core.py

-            output_name,
-            meta.columns,
-            [stream.name for stream in session.streams],
+        return dd.from_map(


Could you pass the label into from_map to make the task prefix more descriptive?

jrbourbeau added 2 commits September 6, 2023 12:03

Drop Python 3.8 and add Python 3.11

238eab7

Merge branch 'main' of https://github.com/coiled/dask-bigquery into d…

18fd1f3

…rop-py38

jrbourbeau mentioned this pull request Jul 16, 2024

Roundtripping timezone-aware DataFrame through parquet doesn't preserve timestamp resolution dask/dask#11230

Closed

jrbourbeau added 4 commits July 17, 2024 13:30

Fix CI + use from_map

e37232a

Python 3.12 + main dask

aeef698

Fix pre-commit?

4efec2b

More

faf4686

jrbourbeau changed the title ~~Drop Python 3.8 and add Python 3.11~~ Drop Python 3.8, add Python 3.11 and 3.12, and use dd.from_map Jul 17, 2024

jrbourbeau commented Jul 17, 2024

View reviewed changes

phofl reviewed Jul 18, 2024

View reviewed changes

jrbourbeau added 2 commits July 18, 2024 11:08

Add descriptive label

4fddc4b

Bump minimum dask version for from_map support

8ddaa43

phofl approved these changes Jul 18, 2024

View reviewed changes

jrbourbeau merged commit 6e6bc41 into main Jul 18, 2024
13 checks passed

jrbourbeau deleted the drop-py38 branch July 18, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop Python 3.8, add Python 3.11 and 3.12, and use `dd.from_map` #81

Drop Python 3.8, add Python 3.11 and 3.12, and use `dd.from_map` #81

jrbourbeau commented Sep 6, 2023

jrbourbeau left a comment

jrbourbeau Jul 17, 2024

jrbourbeau Jul 17, 2024

jrbourbeau Jul 17, 2024

bnaul Jul 17, 2024

tswast Jul 17, 2024

phofl Jul 18, 2024

jrbourbeau Jul 18, 2024

jrbourbeau Jul 17, 2024

jrbourbeau Jul 17, 2024

jrbourbeau Jul 17, 2024

phofl Jul 18, 2024

phofl Jul 18, 2024

jrbourbeau Jul 18, 2024

Drop Python 3.8, add Python 3.11 and 3.12, and use dd.from_map #81

Drop Python 3.8, add Python 3.11 and 3.12, and use dd.from_map #81

Conversation

jrbourbeau commented Sep 6, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Drop Python 3.8, add Python 3.11 and 3.12, and use `dd.from_map` #81

Drop Python 3.8, add Python 3.11 and 3.12, and use `dd.from_map` #81