-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groot/bench-opendata: fix task-7 #5
Conversation
- consider all leptons as a whole - correct isolation criteria Fixes go-hep#4.
Thanks for the fast reply and fix! The code looks good to me (but I learned Go just for the purpose of running this benchmark, so don't trust me). I did confirm that the resulting histogram is exactly the same as three other implementations: (1) mine in Go, (2) the Coffea implementation, (3) one I am writing in SQL. |
cool! (out of curiosity, how was learning Go? would you recommand it for other HEP/astro/... scientists?) |
I found it at least fairly readable; most of my "learning" consisted of guessing what your code does. It also seems pretty concise. I don't know about performance, which is probably one aspect scientists care about. Also, as a researcher in database systems, I think that SQL might be a good (better?) choice ;) |
thanks.
that last statement may be a bit biased :P
(so once you have rewritten all these examples into a set of fancy SQL queries, you could do the same with |
Yes, I suspect so :) Thanks a lot for the pointers, I'll take a look! May I ask why you abandonded that direction? Why is SQL not suited? |
SQL is fine but a bit verbose for "Joe the physicist". |
PartiQL was an experiment with the idea of an SQL model (collection of tables with different lengths, related to one another through foreign keys) applied to each event of a physics dataset. When an SQL model is applied to whole datasets (i.e. a "table of events") then it's possible to express HEP queries, but extremely cumbersome because every query is a "join on eventId" as we're never interested in relating quantities from one event to another event, and the handling of combinatorics of particles within an event gets intricate. This question I asked 4 years ago convinced me that the SQL model is inappropriate for HEP datasets. However, applying an SQL model to each event of a HEP dataset does make some sense: each particle collection (e.g. jets, muons, ...) has a different number of rows and possibly different columns for attributes, so each particle collection in one event is an SQL table. We have connections between these tables using foreign keys, usually row number for a particular sorting of the tables—this is a surrogate key. This event-as-database was Gordon Watts's idea, which he is developing as func_adl. PartiQL was my experiment with the idea: it was originally going to be SQL syntax, but real SQL was still too cumbersome (we always want to join on these surrogate keys, never any natural keys), so only a few keywords like Finally, if we do go the event-as-database route, it would be an entirely different database implementation from what's out there. Most databases are optimized for big tables; this would have to be optimized for many little tables, as the number of particles per event is 0‒2 for some particle types. Instead of a table with a billion rows, we have a billion tables with maybe 1 row. Time complexity for each operation (i.e. O(n) vs O(log(n)) where n is the number of elements in a table; the number that participate in a join) becomes much less important than setup and tear-down times when iterating from one event (collection of tables) to the next. It's an unusual problem that would require unusual solutions. Surely I'm digressing from the focus of this PR, though... |
Fixes #4.