Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks #14

Merged
merged 7 commits into from
Aug 9, 2023
Merged

Add benchmarks #14

merged 7 commits into from
Aug 9, 2023

Conversation

clane9
Copy link
Collaborator

@clane9 clane9 commented Aug 1, 2023

Add benchmarks comparing PyBIDS, ancpBIDS, and bids2table for indexing and querying large-ish datasets.

@codecov
Copy link

codecov bot commented Aug 1, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (886d184) 90.70% compared to head (666c854) 90.70%.

❗ Current head 666c854 differs from pull request most recent head 1c0ac57. Consider uploading reports for the commit 1c0ac57 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #14   +/-   ##
=======================================
  Coverage   90.70%   90.70%           
=======================================
  Files          10       10           
  Lines         441      441           
=======================================
  Hits          400      400           
  Misses         41       41           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@clane9
Copy link
Collaborator Author

clane9 commented Aug 1, 2023

Hey @adelavega, would love to get your thoughts on these benchmarks. Do you think the comparisons are fair? What do you think about the results (bids2table vs pybids: roughly 4x faster indexing, 150x faster with 64 workers in parallel, 90x less size on disk, 20x faster queries)?

@adelavega
Copy link

adelavega commented Aug 1, 2023

Overall looks good!

bids2table certainly has nice advantages:

  • if familiar with pandas API, can do arbitrary indexes, and they are more performant
  • faster indexing than pyBIDS
  • lower memory footprint
  • can parallelize (which makes differences bigger)

I tried it out on this dataset (https://openneuro.org/datasets/ds002837/versions/2.0.0) which shows similar differences

pybids

{'version': '0.15.6.post0.dev97',
 'elapsed': 10.908372353000232,
 'size_mb': 46.956}

pybids w/ index_metadata=False

{'version': '0.15.6.post0.dev97',
 'elapsed': 6.4482560209999065,
 'size_mb': 1.376}

ancpbids

{'version': '0.2.2', 'elapsed': 1.459769660999882, 'size_mb': nan}

bids2table 1 core

{'version': '0.1.dev34+g886d184',
 'elapsed': 2.9782187279997743,
 'size_mb': 12.576}

bids2table 8 cores

{'version': '0.1.dev34+g886d184',
 'elapsed': 0.8461213739992672,
 'size_mb': 12.712}

Interesting that ancpbids is faster w/ one core, but I'm guessing its because it doesn't read the JSON sidecars.
I'm also guessing that pybids is slowed down by a combination of SqlAlchemy overhead + general code inefficiencies which I want to dig into.

Given that you've set this benchmark up, I would try it on several public datasets to get a better estimate on the performance diffs.

@adelavega
Copy link

Other feedback on bids2table:

  • pandas API is not as familar as you might expect. especially using multiindex. I would beef up the docs to show some very common use cases (i.e. how do I get the TR for this particular subject)
  • I wouldn't call the metadata "sidecar" once indexed, since it's actually a combination of various sidecar files.

@adelavega
Copy link

Regarding the query benchmarks:

  • Can you try querying for all subject ids for only a specific task? i.e. a subset operation and then a unique operaiton?
    Also, I'm surprised pybids is that bad, but maybe your bechmark dataset is highlighting some inefficiencies.
    On a smaller n=86 the difference is more respectable (although still quite big)
In [54]: %timeit layout.get_subjects()
29.3 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit b2t_df["sub"].unique().tolist()
113 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Interestingly, using a more efficient SQLAlchemy query I got this result in pybids (not currently implemented in master though):

In [68]: %timeit layout.session.query(Tag._value).filter_by(entity_name='subject').distinct().a
    ...: ll()
2.93 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bids2table is still much faster!

Also: ancpbids does not support indexing using meta-data, since it only loads it when it needs it and not as part of the index.

@clane9
Copy link
Collaborator Author

clane9 commented Aug 2, 2023

Thanks so much for the feedback!

Given that you've set this benchmark up, I would try it on several public datasets to get a better estimate on the performance diffs.

Totally agree. Filling out the benchmark with a few more datasets makes a lot of sense. Ideally with a range of sizes and on different machines. One or both of these factors could be part of the reason why ancpbids is faster than single-thread bids2table in your example.

pandas API is not as familar as you might expect

Totally. This is also feedback I've gotten from others in my group. I think the pandas API is pretty flexible, but also pretty complicated and not all that well known. We've been discussing implementing a higher-level pybids-like API on top. Perhaps following the proposed redesign. This would also open the door for a possible merger down the road, if there was interest.

I wouldn't call the metadata "sidecar" once indexed, since it's actually a combination of various sidecar files.

Ah, because of the inheritance? I'm considering just flattening out the fields in the sidecar column into their own columns, a la pd.json_normalize, and putting them in a general "metadata" column group.

Can you try querying for all subject ids for only a specific task? i.e. a subset operation and then a unique operaiton?
Also, I'm surprised pybids is that bad

Ya we were pretty surprised at how bad pybids was here. We chalked it up as an outlier. I'll try to dig into it more.

A side note on query performance. Although pandas is quite good enough here, there are now even more optimized dataframe libraries (e.g. polars, duckdb, all of which interface well with Arrow/Parquet. So there should be room for even better performance.

@adelavega
Copy link

Ah, because of the inheritance? I'm considering just flattening out the fields in the sidecar column into their own columns, a la pd.json_normalize, and putting them in a general "metadata" column group.

+1 on this

Re: other libraries, I would argue querying is quite fast as it is, so I would err on the side of familiarity (pandas might be complex but at least familar).

Although we could consider another db for the redesigned API project.

@adelavega
Copy link

To expand a bit: I think we should focus on optimizing indexing time more than querying time.

As an example, PyBIDS has some unacceptably slow queries, but looking at the worst one, its 1.2 s, which if we used SQLAlchemy more efficiently it would be an order of magnitude faster which is 0.12s.

That tells me that any of these solutions will be performant enough, if the translation between the high level API and the low-level querying language is done properly (which is the main problem in PyBIDS).

Obviously bids2table is orders of magnitude faster, which is cool and useful, but it's just to say that under a certain floor of performance, we should use other heuristics to guide us.

Where PyBIDS really struggled is the indexing time, and that's where we got most complaints. So I see that as bids2table's biggest contribution.

Let's keep this in mind when building a high-level API, because sometimes the most difficult thing is mapping that easy to use query language in a way that performs.

clane9 added a commit that referenced this pull request Aug 4, 2023
The multi-index columns have been clumsy for some folks. Now return a dataframe with flat columns and a string prefix indicating the column group.

It is still possible to convert to a hierarchical index, select a particular group, or drop the group level in post-processing. This usage is shown in the example.

Other changes:
- rename sidecar -> json following the suggestion in (#14).
- by default, search for inherited metadata until a `dataset_description.json` is found, following the suggestion in (#15).
- Fix `incremental` mode by specifying correct path and modified time column names.
@clane9
Copy link
Collaborator Author

clane9 commented Aug 9, 2023

Merging even though there are still a few improvements to the benchmarks needed:

  • More datasets compared
  • Debug slow PyBIDS performance on the get subjects query
  • Refactor to make the full benchmark suite easy to run for any dataset

These will be addressed in future PRs.

@clane9 clane9 merged commit 0659536 into main Aug 9, 2023
@clane9 clane9 deleted the feat/benchmark branch August 9, 2023 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants