-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarks #14
Add benchmarks #14
Conversation
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## main #14 +/- ##
=======================================
Coverage 90.70% 90.70%
=======================================
Files 10 10
Lines 441 441
=======================================
Hits 400 400
Misses 41 41 ☔ View full report in Codecov by Sentry. |
Hey @adelavega, would love to get your thoughts on these benchmarks. Do you think the comparisons are fair? What do you think about the results (bids2table vs pybids: roughly 4x faster indexing, 150x faster with 64 workers in parallel, 90x less size on disk, 20x faster queries)? |
Overall looks good! bids2table certainly has nice advantages:
I tried it out on this dataset (https://openneuro.org/datasets/ds002837/versions/2.0.0) which shows similar differences pybids
pybids w/
ancpbids
bids2table 1 core
bids2table 8 cores
Interesting that ancpbids is faster w/ one core, but I'm guessing its because it doesn't read the JSON sidecars. Given that you've set this benchmark up, I would try it on several public datasets to get a better estimate on the performance diffs. |
Other feedback on
|
Regarding the query benchmarks:
Interestingly, using a more efficient SQLAlchemy query I got this result in pybids (not currently implemented in
bids2table is still much faster! Also: |
Thanks so much for the feedback!
Totally agree. Filling out the benchmark with a few more datasets makes a lot of sense. Ideally with a range of sizes and on different machines. One or both of these factors could be part of the reason why ancpbids is faster than single-thread bids2table in your example.
Totally. This is also feedback I've gotten from others in my group. I think the pandas API is pretty flexible, but also pretty complicated and not all that well known. We've been discussing implementing a higher-level pybids-like API on top. Perhaps following the proposed redesign. This would also open the door for a possible merger down the road, if there was interest.
Ah, because of the inheritance? I'm considering just flattening out the fields in the sidecar column into their own columns, a la
Ya we were pretty surprised at how bad pybids was here. We chalked it up as an outlier. I'll try to dig into it more. A side note on query performance. Although pandas is quite good enough here, there are now even more optimized dataframe libraries (e.g. polars, duckdb, all of which interface well with Arrow/Parquet. So there should be room for even better performance. |
+1 on this Re: other libraries, I would argue querying is quite fast as it is, so I would err on the side of familiarity (pandas might be complex but at least familar). Although we could consider another db for the redesigned API project. |
To expand a bit: I think we should focus on optimizing indexing time more than querying time. As an example, PyBIDS has some unacceptably slow queries, but looking at the worst one, its 1.2 s, which if we used SQLAlchemy more efficiently it would be an order of magnitude faster which is 0.12s. That tells me that any of these solutions will be performant enough, if the translation between the high level API and the low-level querying language is done properly (which is the main problem in PyBIDS). Obviously bids2table is orders of magnitude faster, which is cool and useful, but it's just to say that under a certain floor of performance, we should use other heuristics to guide us. Where PyBIDS really struggled is the indexing time, and that's where we got most complaints. So I see that as bids2table's biggest contribution. Let's keep this in mind when building a high-level API, because sometimes the most difficult thing is mapping that easy to use query language in a way that performs. |
The multi-index columns have been clumsy for some folks. Now return a dataframe with flat columns and a string prefix indicating the column group. It is still possible to convert to a hierarchical index, select a particular group, or drop the group level in post-processing. This usage is shown in the example. Other changes: - rename sidecar -> json following the suggestion in (#14). - by default, search for inherited metadata until a `dataset_description.json` is found, following the suggestion in (#15). - Fix `incremental` mode by specifying correct path and modified time column names.
Merging even though there are still a few improvements to the benchmarks needed:
These will be addressed in future PRs. |
Add benchmarks comparing PyBIDS, ancpBIDS, and bids2table for indexing and querying large-ish datasets.