Add benchmarks #14

clane9 · 2023-08-01T14:55:22Z

Add benchmarks comparing PyBIDS, ancpBIDS, and bids2table for indexing and querying large-ish datasets.

codecov · 2023-08-01T14:58:25Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (886d184) 90.70% compared to head (666c854) 90.70%.

❗ Current head 666c854 differs from pull request most recent head 1c0ac57. Consider uploading reports for the commit 1c0ac57 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #14   +/-   ##
=======================================
  Coverage   90.70%   90.70%           
=======================================
  Files          10       10           
  Lines         441      441           
=======================================
  Hits          400      400           
  Misses         41       41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

clane9 · 2023-08-01T15:55:35Z

Hey @adelavega, would love to get your thoughts on these benchmarks. Do you think the comparisons are fair? What do you think about the results (bids2table vs pybids: roughly 4x faster indexing, 150x faster with 64 workers in parallel, 90x less size on disk, 20x faster queries)?

adelavega · 2023-08-01T22:18:45Z

Overall looks good!

bids2table certainly has nice advantages:

if familiar with pandas API, can do arbitrary indexes, and they are more performant
faster indexing than pyBIDS
lower memory footprint
can parallelize (which makes differences bigger)

I tried it out on this dataset (https://openneuro.org/datasets/ds002837/versions/2.0.0) which shows similar differences

pybids

{'version': '0.15.6.post0.dev97',
 'elapsed': 10.908372353000232,
 'size_mb': 46.956}

pybids w/ index_metadata=False

{'version': '0.15.6.post0.dev97',
 'elapsed': 6.4482560209999065,
 'size_mb': 1.376}

ancpbids

{'version': '0.2.2', 'elapsed': 1.459769660999882, 'size_mb': nan}

bids2table 1 core

{'version': '0.1.dev34+g886d184',
 'elapsed': 2.9782187279997743,
 'size_mb': 12.576}

bids2table 8 cores

{'version': '0.1.dev34+g886d184',
 'elapsed': 0.8461213739992672,
 'size_mb': 12.712}

Interesting that ancpbids is faster w/ one core, but I'm guessing its because it doesn't read the JSON sidecars.
I'm also guessing that pybids is slowed down by a combination of SqlAlchemy overhead + general code inefficiencies which I want to dig into.

Given that you've set this benchmark up, I would try it on several public datasets to get a better estimate on the performance diffs.

adelavega · 2023-08-01T22:19:58Z

Other feedback on bids2table:

pandas API is not as familar as you might expect. especially using multiindex. I would beef up the docs to show some very common use cases (i.e. how do I get the TR for this particular subject)
I wouldn't call the metadata "sidecar" once indexed, since it's actually a combination of various sidecar files.

adelavega · 2023-08-01T23:28:47Z

Regarding the query benchmarks:

Can you try querying for all subject ids for only a specific task? i.e. a subset operation and then a unique operaiton?
Also, I'm surprised pybids is that bad, but maybe your bechmark dataset is highlighting some inefficiencies.
On a smaller n=86 the difference is more respectable (although still quite big)

In [54]: %timeit layout.get_subjects()
29.3 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit b2t_df["sub"].unique().tolist()
113 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Interestingly, using a more efficient SQLAlchemy query I got this result in pybids (not currently implemented in master though):

In [68]: %timeit layout.session.query(Tag._value).filter_by(entity_name='subject').distinct().a
    ...: ll()
2.93 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bids2table is still much faster!

Also: ancpbids does not support indexing using meta-data, since it only loads it when it needs it and not as part of the index.

clane9 · 2023-08-02T16:46:34Z

Thanks so much for the feedback!

Given that you've set this benchmark up, I would try it on several public datasets to get a better estimate on the performance diffs.

Totally agree. Filling out the benchmark with a few more datasets makes a lot of sense. Ideally with a range of sizes and on different machines. One or both of these factors could be part of the reason why ancpbids is faster than single-thread bids2table in your example.

pandas API is not as familar as you might expect

Totally. This is also feedback I've gotten from others in my group. I think the pandas API is pretty flexible, but also pretty complicated and not all that well known. We've been discussing implementing a higher-level pybids-like API on top. Perhaps following the proposed redesign. This would also open the door for a possible merger down the road, if there was interest.

I wouldn't call the metadata "sidecar" once indexed, since it's actually a combination of various sidecar files.

Ah, because of the inheritance? I'm considering just flattening out the fields in the sidecar column into their own columns, a la pd.json_normalize, and putting them in a general "metadata" column group.

Can you try querying for all subject ids for only a specific task? i.e. a subset operation and then a unique operaiton?
Also, I'm surprised pybids is that bad

Ya we were pretty surprised at how bad pybids was here. We chalked it up as an outlier. I'll try to dig into it more.

A side note on query performance. Although pandas is quite good enough here, there are now even more optimized dataframe libraries (e.g. polars, duckdb, all of which interface well with Arrow/Parquet. So there should be room for even better performance.

adelavega · 2023-08-02T16:53:03Z

Ah, because of the inheritance? I'm considering just flattening out the fields in the sidecar column into their own columns, a la pd.json_normalize, and putting them in a general "metadata" column group.

+1 on this

Re: other libraries, I would argue querying is quite fast as it is, so I would err on the side of familiarity (pandas might be complex but at least familar).

Although we could consider another db for the redesigned API project.

adelavega · 2023-08-02T17:11:16Z

To expand a bit: I think we should focus on optimizing indexing time more than querying time.

As an example, PyBIDS has some unacceptably slow queries, but looking at the worst one, its 1.2 s, which if we used SQLAlchemy more efficiently it would be an order of magnitude faster which is 0.12s.

That tells me that any of these solutions will be performant enough, if the translation between the high level API and the low-level querying language is done properly (which is the main problem in PyBIDS).

Obviously bids2table is orders of magnitude faster, which is cool and useful, but it's just to say that under a certain floor of performance, we should use other heuristics to guide us.

Where PyBIDS really struggled is the indexing time, and that's where we got most complaints. So I see that as bids2table's biggest contribution.

Let's keep this in mind when building a high-level API, because sometimes the most difficult thing is mapping that easy to use query language in a way that performs.

The multi-index columns have been clumsy for some folks. Now return a dataframe with flat columns and a string prefix indicating the column group. It is still possible to convert to a hierarchical index, select a particular group, or drop the group level in post-processing. This usage is shown in the example. Other changes: - rename sidecar -> json following the suggestion in (#14). - by default, search for inherited metadata until a `dataset_description.json` is found, following the suggestion in (#15). - Fix `incremental` mode by specifying correct path and modified time column names.

clane9 · 2023-08-09T07:39:30Z

Merging even though there are still a few improvements to the benchmarks needed:

More datasets compared
Debug slow PyBIDS performance on the get subjects query
Refactor to make the full benchmark suite easy to run for any dataset

These will be addressed in future PRs.

clane9 added 6 commits July 17, 2023 09:44

Add indexing benchmark script

1586c20

Update indexing benchmark

f9b73cf

Merge branch 'main' into feat/benchmark

b7d1707

Add indexing benchmark analysis notebook

ffa02df

Merge branch 'main' into feat/benchmark

6251b8a

Add query benchmark and update README

666c854

Update README

1c0ac57

clane9 mentioned this pull request Aug 4, 2023

Flatten columns, column renaming #16

Merged

clane9 merged commit 0659536 into main Aug 9, 2023

clane9 deleted the feat/benchmark branch August 9, 2023 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks #14

Add benchmarks #14

clane9 commented Aug 1, 2023

codecov bot commented Aug 1, 2023 •

edited

Loading

clane9 commented Aug 1, 2023

adelavega commented Aug 1, 2023 •

edited

Loading

adelavega commented Aug 1, 2023

adelavega commented Aug 1, 2023

clane9 commented Aug 2, 2023

adelavega commented Aug 2, 2023

adelavega commented Aug 2, 2023

clane9 commented Aug 9, 2023

Add benchmarks #14

Add benchmarks #14

Conversation

clane9 commented Aug 1, 2023

codecov bot commented Aug 1, 2023 • edited Loading

Codecov Report

clane9 commented Aug 1, 2023

adelavega commented Aug 1, 2023 • edited Loading

adelavega commented Aug 1, 2023

adelavega commented Aug 1, 2023

clane9 commented Aug 2, 2023

adelavega commented Aug 2, 2023

adelavega commented Aug 2, 2023

clane9 commented Aug 9, 2023

codecov bot commented Aug 1, 2023 •

edited

Loading

adelavega commented Aug 1, 2023 •

edited

Loading