Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement inner/outer/left/right joins #262

Open
lgray opened this issue May 13, 2023 · 3 comments
Open

Implement inner/outer/left/right joins #262

lgray opened this issue May 13, 2023 · 3 comments

Comments

@lgray
Copy link
Collaborator

lgray commented May 13, 2023

It will be the case that HEP analyses are going to join small columns of privately produced data with a larger central set.
This is not yet possible in dask_awkward but is reasonably well trodden in dask-array and dask-dataframe.

The goal would be to take two datasets with a shared key definition (like run number, luminosity block, and event number) and then merge those tables together into a single unified dataset that a dask_awkward analysis can handle cleanly (accounting for missing keys, file skew, etc. - i.e. the usual general distributed table join operations a la SQL).

Related to / likely depends on #250

@lgray
Copy link
Collaborator Author

lgray commented May 13, 2023

@iasonkrom

@martindurant
Copy link
Collaborator

Joins on a single-partition of concrete in-memory array we can implement trivially.

is reasonably well trodden in dask-array and dask-dataframe.

Only the latter? It depends on a well-understood concept of an index, which doesn't really apply to our arrays. Is such a join always "row-wise", meaning by top-level array item rather than at some deeper nesting level in the schema?

@lgray
Copy link
Collaborator Author

lgray commented May 15, 2023

It applies to records, at least in HEP.
Indeed, it's rowwise, so there would be some key that we specify (like run number, luminosity block, event number) that we would then join two record arrays together based on.

I think if we could get the row-wise implementation first the more arbitrary keys could come later, the row-wise should be by far the most common and useful use case. The distributed join would be the most important thing to address, since the in-memory part is easy as you say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants