Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sort objects in dicts/sets by normalize_token rather than str #10918

Closed
wants to merge 3 commits into from

Conversation

lgray
Copy link
Contributor

@lgray lgray commented Feb 11, 2024

This ensures that __dask_tokenize__ is the only entrypoint that other modules need to care about when interacting with dask.base.tokenize.

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

Closes dask-contrib/dask-awkward#468

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@lgray
Copy link
Contributor Author

lgray commented Feb 11, 2024

c.f. #10898 #10905

@martindurant @crusaderky

@lgray
Copy link
Contributor Author

lgray commented Feb 11, 2024

Here's a very minimal reproducer involving awkward array:

import json
import awkward as ak

from dask.base import tokenize, normalize_token

a = ak.Array([1, 2, 3, 4, 5])

layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())

b = ak.Array(layout)

print(report.data_touched, report.shape_touched)

d = {"what": b}

normalized = sorted(d.items(), key=lambda x: str(normalize_token(x)))

print(report.data_touched, report.shape_touched)

normalized = sorted(d.items(), key=str)

print(report.data_touched, report.shape_touched)

resulting in:

(coffea-dev-py311) lgray@visitor-122024646 coffea % python form_madness_repr_minimal.py
[] []
[] []
['node0'] ['node0', 'node0']

Which means str is intrusive to these objects while normalize_token itself is not, and could be instead handled through __dask_tokenize__. This PR makes it so that __dask_tokenize__ is the sole entrypoint to the tokenize machinery.

This should aid in the deterministic keys effort in the long run.

@lgray
Copy link
Contributor Author

lgray commented Feb 11, 2024

image

This appears to cause tokenization of SubgraphCallables that are exactly equal to now become equal.
I'm not sure that's desired or not in this phase of development.

@lgray
Copy link
Contributor Author

lgray commented Feb 11, 2024

Darn, this methodology has the unfortunate outcome that we are sensitive to the history of the graph. So if you start from a different graph and optimize it, that is not the same as constructing an optimized graph by hand.

This is why the SubgraphCallable tests are failing.

Copy link
Contributor

github-actions bot commented Feb 11, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ± 0       15 suites  ±0   3h 11m 48s ⏱️ - 6m 0s
 12 989 tests + 1   12 059 ✅ ±0     930 💤 + 1  0 ❌ ±0 
160 522 runs  +15  144 030 ✅  - 4  16 492 💤 +19  0 ❌ ±0 

Results for commit f1e196f. ± Comparison against base commit 7726d31.

♻️ This comment has been updated with latest results.

@lgray
Copy link
Contributor Author

lgray commented Feb 11, 2024

Hmm not fixed by including #10883 (actually this issue is broken further by that PR).

@lgray
Copy link
Contributor Author

lgray commented Feb 11, 2024

OK - this seems to get it all in reasonable shape, doesn't break any existing rules, and now has a test that shows what the problem is.

After getting it all in shape I tried merging #10883 again and that PR still causes the test to break.

@agoose77
Copy link
Contributor

@lgray I'd be tempted to remove the test dependency on awkward, and instead make a mock type that throws an error in __str__.

@@ -73,6 +73,7 @@ complete = [
]
test = [
"pandas[test]",
"awkward",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 from me. I'd rather not add test dependencies to our already heavy test suite.
Also this doesn't actually run the added test in CI.

@lgray
Copy link
Contributor Author

lgray commented Feb 12, 2024

Superseded by #10919

@lgray lgray closed this Feb 12, 2024
@@ -1075,7 +1075,7 @@ def normalize_set(s):
# Note: in some Python version / OS combinations, set order changes every
# time you recreate the set (even within the same interpreter).
# In most other cases, set ordering is consistent within the same interpreter.
return "set", _normalize_seq_func(sorted(s, key=str))
return "set", _normalize_seq_func(sorted(s, key=tokenize))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite expensive as it will tokenize everything twice. The benefit compared to str is only (1) when str is expensive and (2) when different objects have the same str output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

massive touching performance regression when using dask 2024.2.0
4 participants