Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for dask-awkward one-pass optimize #1225

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

martindurant
Copy link

This makes the basic uproot tests in dask-contrib/dask-awkward#491 pass

cc @lgray

@@ -954,6 +954,7 @@ def __call__(self, form: Form) -> tuple[Form, TrivialFormMappingInfo]:
class UprootReadMixin:
base_form: Form
expected_form: Form
behavior = {}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how behaviours should be handled, so this probably needs updating

@@ -889,7 +888,9 @@ def keys_for_buffer_keys(self, buffer_keys: frozenset[str]) -> frozenset[str]:
keys: set[str] = set()
for buffer_key in buffer_keys:
# Identify form key
form_key, attribute = buffer_key.rsplit("-", maxsplit=1)
form_key, attribute = buffer_key.replace("@.", "<root>.").rsplit(
"-", maxsplit=1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this fail for any of the styles of buffer name? We could use [0] or *attribute to guard against this (we don't actually use the attribute).

In fact, this whole loop could be a single comprehension.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what characters are not allowed in a TBranch name, which gets passed over to the form_keys. It would be hard to use a : in a TBranch name, since that gets interpreted by ROOT (in some constructors, not all). On the other hand, . is very common and can't be a delimiter.

I tried to see what would break it:

root [0] TTree* t = new TTree("name", "title")
(TTree *) 0x6360d13c1720
root [1] (new TBranch(t, "something", nullptr, "leaf"))->GetName()
(const char *) "something"
root [2] (new TBranch(t, "some.thing", nullptr, "leaf"))->GetName()
(const char *) "some.thing"
root [3] (new TBranch(t, "some:thing", nullptr, "leaf"))->GetName()
(const char *) "some:thing"
root [4] (new TBranch(t, "some/thing", nullptr, "leaf"))->GetName()
(const char *) "some/thing"
root [5] (new TBranch(t, "some\"thing", nullptr, "leaf"))->GetName()
(const char *) "some"thing"
root [6] (new TBranch(t, "some`thing", nullptr, "leaf"))->GetName()
(const char *) "some`thing"
root [7] (new TBranch(t, "some!thing", nullptr, "leaf"))->GetName()
(const char *) "some!thing"
root [8] (new TBranch(t, "some@thing", nullptr, "leaf"))->GetName()
(const char *) "some@thing"
root [9] (new TBranch(t, "some#thing", nullptr, "leaf"))->GetName()
(const char *) "some#thing"
root [10] (new TBranch(t, "some+thing", nullptr, "leaf"))->GetName()
(const char *) "some+thing"
root [11] (new TBranch(t, "some(thing", nullptr, "leaf"))->GetName()
(const char *) "some(thing"
root [12] (new TBranch(t, "some{thing", nullptr, "leaf"))->GetName()
(const char *) "some{thing"
root [13] (new TBranch(t, "some[thing", nullptr, "leaf"))->GetName()
(const char *) "some[thing"
root [14] (new TBranch(t, "some\\thing", nullptr, "leaf"))->GetName()
(const char *) "some\thing"
root [15] (new TBranch(t, "some|thing", nullptr, "leaf"))->GetName()
(const char *) "some|thing"
root [16] (new TBranch(t, "some;thing", nullptr, "leaf"))->GetName()
(const char *) "some;thing"
root [17] (new TBranch(t, "some\0thing", nullptr, "leaf"))->GetName()
(const char *) "some"

Here's something we can count on: a ROOT TBranch name is not going to have a null byte (Python "\x00") in the middle of the string, whereas this is legal and would be properly carried around in Python. It seems pretty drastic, though.


def project(self: T, *, report: TypeTracerReport, state: dict) -> T:
keys = self.necessary_columns(report=report, state=state)
keys = [_buf_to_col(c).replace(".", "_") for c in columns]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgray : this is the simplistic way to unmap the column names. This line for the simple case, the if-block for nanoAOD, adding in implicit columns like nJet. Then, this doesn't actually make use of the schema and keys_for_buffer_keys

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reminder to @lgray to take a look at this.)

@@ -889,7 +888,9 @@ def keys_for_buffer_keys(self, buffer_keys: frozenset[str]) -> frozenset[str]:
keys: set[str] = set()
for buffer_key in buffer_keys:
# Identify form key
form_key, attribute = buffer_key.rsplit("-", maxsplit=1)
form_key, attribute = buffer_key.replace("@.", "<root>.").rsplit(
"-", maxsplit=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what characters are not allowed in a TBranch name, which gets passed over to the form_keys. It would be hard to use a : in a TBranch name, since that gets interpreted by ROOT (in some constructors, not all). On the other hand, . is very common and can't be a delimiter.

I tried to see what would break it:

root [0] TTree* t = new TTree("name", "title")
(TTree *) 0x6360d13c1720
root [1] (new TBranch(t, "something", nullptr, "leaf"))->GetName()
(const char *) "something"
root [2] (new TBranch(t, "some.thing", nullptr, "leaf"))->GetName()
(const char *) "some.thing"
root [3] (new TBranch(t, "some:thing", nullptr, "leaf"))->GetName()
(const char *) "some:thing"
root [4] (new TBranch(t, "some/thing", nullptr, "leaf"))->GetName()
(const char *) "some/thing"
root [5] (new TBranch(t, "some\"thing", nullptr, "leaf"))->GetName()
(const char *) "some"thing"
root [6] (new TBranch(t, "some`thing", nullptr, "leaf"))->GetName()
(const char *) "some`thing"
root [7] (new TBranch(t, "some!thing", nullptr, "leaf"))->GetName()
(const char *) "some!thing"
root [8] (new TBranch(t, "some@thing", nullptr, "leaf"))->GetName()
(const char *) "some@thing"
root [9] (new TBranch(t, "some#thing", nullptr, "leaf"))->GetName()
(const char *) "some#thing"
root [10] (new TBranch(t, "some+thing", nullptr, "leaf"))->GetName()
(const char *) "some+thing"
root [11] (new TBranch(t, "some(thing", nullptr, "leaf"))->GetName()
(const char *) "some(thing"
root [12] (new TBranch(t, "some{thing", nullptr, "leaf"))->GetName()
(const char *) "some{thing"
root [13] (new TBranch(t, "some[thing", nullptr, "leaf"))->GetName()
(const char *) "some[thing"
root [14] (new TBranch(t, "some\\thing", nullptr, "leaf"))->GetName()
(const char *) "some\thing"
root [15] (new TBranch(t, "some|thing", nullptr, "leaf"))->GetName()
(const char *) "some|thing"
root [16] (new TBranch(t, "some;thing", nullptr, "leaf"))->GetName()
(const char *) "some;thing"
root [17] (new TBranch(t, "some\0thing", nullptr, "leaf"))->GetName()
(const char *) "some"

Here's something we can count on: a ROOT TBranch name is not going to have a null byte (Python "\x00") in the middle of the string, whereas this is legal and would be properly carried around in Python. It seems pretty drastic, though.


def project(self: T, *, report: TypeTracerReport, state: dict) -> T:
keys = self.necessary_columns(report=report, state=state)
keys = [_buf_to_col(c).replace(".", "_") for c in columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reminder to @lgray to take a look at this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants