fix: sort objects in dicts/sets by normalize_token rather than str #10918

lgray · 2024-02-11T20:14:39Z

This ensures that __dask_tokenize__ is the only entrypoint that other modules need to care about when interacting with dask.base.tokenize.

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

Closes dask-contrib/dask-awkward#468

GPUtester · 2024-02-11T20:14:42Z

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

lgray · 2024-02-11T20:15:42Z

c.f. #10898 #10905

@martindurant @crusaderky

lgray · 2024-02-11T20:25:29Z

Here's a very minimal reproducer involving awkward array:

import json
import awkward as ak

from dask.base import tokenize, normalize_token

a = ak.Array([1, 2, 3, 4, 5])

layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())

b = ak.Array(layout)

print(report.data_touched, report.shape_touched)

d = {"what": b}

normalized = sorted(d.items(), key=lambda x: str(normalize_token(x)))

print(report.data_touched, report.shape_touched)

normalized = sorted(d.items(), key=str)

print(report.data_touched, report.shape_touched)

resulting in:

(coffea-dev-py311) lgray@visitor-122024646 coffea % python form_madness_repr_minimal.py
[] []
[] []
['node0'] ['node0', 'node0']

Which means str is intrusive to these objects while normalize_token itself is not, and could be instead handled through __dask_tokenize__. This PR makes it so that __dask_tokenize__ is the sole entrypoint to the tokenize machinery.

This should aid in the deterministic keys effort in the long run.

lgray · 2024-02-11T20:30:48Z

This appears to cause tokenization of SubgraphCallables that are exactly equal to now become equal.
I'm not sure that's desired or not in this phase of development.

lgray · 2024-02-11T20:49:47Z

Darn, this methodology has the unfortunate outcome that we are sensitive to the history of the graph. So if you start from a different graph and optimize it, that is not the same as constructing an optimized graph by hand.

This is why the SubgraphCallable tests are failing.

github-actions · 2024-02-11T20:52:10Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

15 files ± 0 15 suites ±0 3h 11m 48s ⏱️ - 6m 0s
12 989 tests + 1 12 059 ✅ ±0 930 💤 + 1 0 ❌ ±0
160 522 runs +15 144 030 ✅ - 4 16 492 💤 +19 0 ❌ ±0

Results for commit f1e196f. ± Comparison against base commit 7726d31.

♻️ This comment has been updated with latest results.

lgray · 2024-02-11T21:00:51Z

Hmm not fixed by including #10883 (actually this issue is broken further by that PR).

lgray · 2024-02-11T23:32:04Z

OK - this seems to get it all in reasonable shape, doesn't break any existing rules, and now has a test that shows what the problem is.

After getting it all in shape I tried merging #10883 again and that PR still causes the test to break.

agoose77 · 2024-02-12T10:58:18Z

@lgray I'd be tempted to remove the test dependency on awkward, and instead make a mock type that throws an error in __str__.

crusaderky · 2024-02-12T16:13:23Z

pyproject.toml

@@ -73,6 +73,7 @@ complete = [
 ]
 test = [
    "pandas[test]",
+    "awkward",


-1 from me. I'd rather not add test dependencies to our already heavy test suite.
Also this doesn't actually run the added test in CI.

lgray · 2024-02-12T16:14:20Z

Superseded by #10919

crusaderky · 2024-02-12T16:14:31Z

dask/base.py

@@ -1075,7 +1075,7 @@ def normalize_set(s):
    # Note: in some Python version / OS combinations, set order changes every
    # time you recreate the set (even within the same interpreter).
    # In most other cases, set ordering is consistent within the same interpreter.
-    return "set", _normalize_seq_func(sorted(s, key=str))
+    return "set", _normalize_seq_func(sorted(s, key=tokenize))


This is quite expensive as it will tokenize everything twice. The benefit compared to str is only (1) when str is expensive and (2) when different objects have the same str output.

fix: sort objects in dicts/sets by normalize_token rather than str

62f707d

lgray force-pushed the patch-1 branch from 1f5151b to 62f707d Compare February 11, 2024 23:15

lgray added 2 commits February 11, 2024 15:24

add test against awkward

361ceff

appease linter

f1e196f

This was referenced Feb 11, 2024

feat: add __dask_tokenize__ hook scikit-hep/awkward#3017

Merged

array.layout.form touches shape scikit-hep/awkward#3018

Closed

lgray marked this pull request as draft February 12, 2024 03:44

martindurant mentioned this pull request Feb 12, 2024

Tokenize SubgraphCallable #10898

Merged

crusaderky reviewed Feb 12, 2024

View reviewed changes

lgray closed this Feb 12, 2024

crusaderky reviewed Feb 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sort objects in dicts/sets by normalize_token rather than str #10918

fix: sort objects in dicts/sets by normalize_token rather than str #10918

lgray commented Feb 11, 2024 •

edited

Loading

GPUtester commented Feb 11, 2024

lgray commented Feb 11, 2024

lgray commented Feb 11, 2024

lgray commented Feb 11, 2024 •

edited

Loading

lgray commented Feb 11, 2024

github-actions bot commented Feb 11, 2024 •

edited

Loading

lgray commented Feb 11, 2024 •

edited

Loading

lgray commented Feb 11, 2024

agoose77 commented Feb 12, 2024

crusaderky Feb 12, 2024

lgray commented Feb 12, 2024 •

edited

Loading

crusaderky Feb 12, 2024

fix: sort objects in dicts/sets by normalize_token rather than str #10918

fix: sort objects in dicts/sets by normalize_token rather than str #10918

Conversation

lgray commented Feb 11, 2024 • edited Loading

GPUtester commented Feb 11, 2024

lgray commented Feb 11, 2024

lgray commented Feb 11, 2024

lgray commented Feb 11, 2024 • edited Loading

lgray commented Feb 11, 2024

github-actions bot commented Feb 11, 2024 • edited Loading

Unit Test Results

lgray commented Feb 11, 2024 • edited Loading

lgray commented Feb 11, 2024

agoose77 commented Feb 12, 2024

crusaderky Feb 12, 2024

Choose a reason for hiding this comment

lgray commented Feb 12, 2024 • edited Loading

crusaderky Feb 12, 2024

Choose a reason for hiding this comment

lgray commented Feb 11, 2024 •

edited

Loading

lgray commented Feb 11, 2024 •

edited

Loading

github-actions bot commented Feb 11, 2024 •

edited

Loading

lgray commented Feb 11, 2024 •

edited

Loading

lgray commented Feb 12, 2024 •

edited

Loading