Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groot/bench-opendata: fix task-7 #5

Merged
merged 1 commit into from
Apr 17, 2020
Merged

Conversation

sbinet
Copy link
Member

@sbinet sbinet commented Apr 16, 2020

  • consider all leptons as a whole
  • correct isolation criteria

Fixes #4.

- consider all leptons as a whole
- correct isolation criteria

Fixes go-hep#4.
@sbinet sbinet mentioned this pull request Apr 16, 2020
@ingomueller-net
Copy link
Contributor

Thanks for the fast reply and fix! The code looks good to me (but I learned Go just for the purpose of running this benchmark, so don't trust me). I did confirm that the resulting histogram is exactly the same as three other implementations: (1) mine in Go, (2) the Coffea implementation, (3) one I am writing in SQL.

@sbinet
Copy link
Member Author

sbinet commented Apr 17, 2020

cool!

(out of curiosity, how was learning Go? would you recommand it for other HEP/astro/... scientists?)

@sbinet sbinet merged commit 34f4405 into go-hep:master Apr 17, 2020
@sbinet sbinet deleted the issue-4 branch April 17, 2020 10:43
@ingomueller-net
Copy link
Contributor

I found it at least fairly readable; most of my "learning" consisted of guessing what your code does. It also seems pretty concise. I don't know about performance, which is probably one aspect scientists care about. Also, as a researcher in database systems, I think that SQL might be a good (better?) choice ;)

@sbinet
Copy link
Member Author

sbinet commented Apr 17, 2020

thanks.

Also, as a researcher in database systems, I think that SQL might be a good (better?) choice ;)

that last statement may be a bit biased :P
at least it isn't the same than mine.
even if I did dabble into those waters:

(so once you have rewritten all these examples into a set of fancy SQL queries, you could do the same with groot/rsql :P)

@ingomueller-net
Copy link
Contributor

Yes, I suspect so :) Thanks a lot for the pointers, I'll take a look! May I ask why you abandonded that direction? Why is SQL not suited?

@sbinet
Copy link
Member Author

sbinet commented Apr 17, 2020

SQL is fine but a bit verbose for "Joe the physicist".
@jpivarski has been working on creating an SQL-like language more (better?) tailored at the kind of work HEP people do:

@jpivarski
Copy link

PartiQL was an experiment with the idea of an SQL model (collection of tables with different lengths, related to one another through foreign keys) applied to each event of a physics dataset. When an SQL model is applied to whole datasets (i.e. a "table of events") then it's possible to express HEP queries, but extremely cumbersome because every query is a "join on eventId" as we're never interested in relating quantities from one event to another event, and the handling of combinatorics of particles within an event gets intricate. This question I asked 4 years ago convinced me that the SQL model is inappropriate for HEP datasets.

However, applying an SQL model to each event of a HEP dataset does make some sense: each particle collection (e.g. jets, muons, ...) has a different number of rows and possibly different columns for attributes, so each particle collection in one event is an SQL table. We have connections between these tables using foreign keys, usually row number for a particular sorting of the tables—this is a surrogate key.

This event-as-database was Gordon Watts's idea, which he is developing as func_adl. PartiQL was my experiment with the idea: it was originally going to be SQL syntax, but real SQL was still too cumbersome (we always want to join on these surrogate keys, never any natural keys), so only a few keywords like JOIN are derived from SQL. It's also being more seriously developed by Lindsey Gray as AwkwardQL.

Finally, if we do go the event-as-database route, it would be an entirely different database implementation from what's out there. Most databases are optimized for big tables; this would have to be optimized for many little tables, as the number of particles per event is 0‒2 for some particle types. Instead of a table with a billion rows, we have a billion tables with maybe 1 row. Time complexity for each operation (i.e. O(n) vs O(log(n)) where n is the number of elements in a table; the number that participate in a join) becomes much less important than setup and tear-down times when iterating from one event (collection of tables) to the next. It's an unusual problem that would require unusual solutions.

Surely I'm digressing from the focus of this PR, though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bugs in Task 7
3 participants