-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add benchmarks comparing [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and bids2table for indexing and querying large-ish datasets.
- Loading branch information
Showing
14 changed files
with
1,109 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Indexing benchmark | ||
|
||
In this benchmark we compare the indexing performance for [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and [bids2table](https://github.com/cmi-dair/bids2table) using the raw data from the [NKI Rockland Sample](https://fcon_1000.projects.nitrc.org/indi/pro/nki.html) (1334 subjects, 1.5TB). | ||
|
||
## Setup environment | ||
|
||
First install each library in a separate environment using the [setup_environments.sh](setup_environments.sh) script. The main libraries are pinned to the following versions. | ||
|
||
- PyBIDS: 0.16.0 | ||
- ancpBIDS: 0.2.2 | ||
- bids2table: 0.1.dev29+gb7b1658 | ||
|
||
## Run benchmark | ||
|
||
The [benchmark_indexing.py](benchmark_indexing.py) script runs the benchmark. We ran the benchmark on the [PSC bridges2](https://www.psc.edu/resources/bridges-2/) HPC system. The SLURM job scripts for each library are included in this directory. | ||
|
||
## Results | ||
|
||
We evaluated the indexing performance in terms of run time and index size on disk. We evaluated single-threaded performance for each method, as well as the parallel speedup for bids2table. We report the median for 10 runs. | ||
|
||
In terms of single-threaded run time, bids2table is ~4x faster than PyBIDS and ~15% faster than ancpBIDS. In addition, the parallel speedup for bids2table is nearly linear. With 64 workers available, bids2table is ~150x faster than PyBIDS and ~35x faster than ancpBIDS. | ||
|
||
In terms of index size, bids2table requires ~90x less disk space than PyBIDS, thanks to the efficient compression used by the [Parquet file format](https://parquet.apache.org/). The index size of ancpBIDS could not be compared, since it does not appear to support persisting the index to disk. | ||
|
||
|
||
![Indexing performance](figures/indexing_performance.png) |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
""" | ||
A script to benchmark different methods for indexing BIDS datasets. | ||
""" | ||
|
||
import argparse | ||
import json | ||
import socket | ||
import subprocess | ||
import sys | ||
import tempfile | ||
import time | ||
from pathlib import Path | ||
|
||
try: | ||
import bids | ||
|
||
has_pybids = True | ||
except ModuleNotFoundError: | ||
has_pybids = False | ||
|
||
try: | ||
import ancpbids | ||
|
||
has_ancpbids = True | ||
except ModuleNotFoundError: | ||
has_ancpbids = False | ||
|
||
try: | ||
import bids2table as b2t | ||
|
||
has_bids2table = True | ||
except ModuleNotFoundError: | ||
has_bids2table = False | ||
|
||
METHODS = ("pybids", "ancpbids", "bids2table") | ||
|
||
|
||
def du(path): | ||
size_kb = subprocess.check_output(["du", "-sk", path]) | ||
size_kb = int(size_kb.split()[0].decode("utf-8")) | ||
size_mb = int(size_kb) / 1000 | ||
return size_mb | ||
|
||
|
||
def benchmark_pybids(root: Path, workers: int): | ||
assert has_pybids, "pybids not installed" | ||
assert workers == 1, "pybids doesn't use multiple workers" | ||
tic = time.monotonic() | ||
|
||
with tempfile.TemporaryDirectory() as tmpdir: | ||
indexer = bids.BIDSLayoutIndexer( | ||
validate=False, | ||
index_metadata=True, | ||
) | ||
bids.BIDSLayout( | ||
root=root, | ||
validate=False, | ||
absolute_paths=True, | ||
derivatives=True, | ||
database_path=Path(tmpdir) / "index.db", | ||
indexer=indexer, | ||
) | ||
size_mb = du(tmpdir) | ||
|
||
elapsed = time.monotonic() - tic | ||
|
||
result = { | ||
"version": bids.__version__, | ||
"elapsed": elapsed, | ||
"size_mb": size_mb, | ||
} | ||
return result | ||
|
||
|
||
def benchmark_ancpbids(root: Path, workers: int): | ||
assert has_ancpbids, "ancpbids not installed" | ||
assert workers == 1, "ancpbids doesn't use multiple workers" | ||
tic = time.monotonic() | ||
|
||
ancpbids.load_dataset(root) | ||
# ancpbids is purely in-memory | ||
size_mb = float("nan") | ||
|
||
elapsed = time.monotonic() - tic | ||
|
||
result = { | ||
"version": ancpbids.__version__, | ||
"elapsed": elapsed, | ||
"size_mb": size_mb, | ||
} | ||
return result | ||
|
||
|
||
def benchmark_bids2table(root: Path, workers: int): | ||
assert has_bids2table, "bids2table not installed" | ||
tic = time.monotonic() | ||
|
||
with tempfile.TemporaryDirectory() as tmpdir: | ||
b2t.bids2table( | ||
root, | ||
persistent=True, | ||
output=Path(tmpdir) / "index.b2t", | ||
workers=workers, | ||
return_df=False, | ||
) | ||
size_mb = du(tmpdir) | ||
|
||
elapsed = time.monotonic() - tic | ||
result = { | ||
"version": b2t.__version__, | ||
"elapsed": elapsed, | ||
"size_mb": size_mb, | ||
} | ||
return result | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser() | ||
|
||
parser.add_argument( | ||
"method", | ||
metavar="METHOD", | ||
type=str, | ||
choices=METHODS, | ||
help="Indexing method", | ||
) | ||
parser.add_argument( | ||
"root", | ||
metavar="ROOT", | ||
type=Path, | ||
help="Path to BIDS dataset", | ||
) | ||
parser.add_argument( | ||
"out", | ||
metavar="OUTPUT", | ||
type=Path, | ||
help="Path to JSON output", | ||
) | ||
parser.add_argument( | ||
"--workers", | ||
"-w", | ||
metavar="COUNT", | ||
type=int, | ||
help="Number of worker processes", | ||
default=1, | ||
) | ||
|
||
args = parser.parse_args() | ||
|
||
if args.method == "pybids": | ||
result = benchmark_pybids(args.root, args.workers) | ||
if args.method == "ancpbids": | ||
result = benchmark_ancpbids(args.root, args.workers) | ||
elif args.method == "bids2table": | ||
result = benchmark_bids2table(args.root, args.workers) | ||
else: | ||
raise NotImplementedError(f"Indexing method {args.method} not implemented") | ||
|
||
result = { | ||
"bids_dir": str(args.root.absolute()), | ||
"method": args.method, | ||
"workers": args.workers, | ||
"py_version": sys.version, | ||
"hostname": socket.gethostname(), | ||
**result, | ||
} | ||
|
||
with open(args.out, "w") as f: | ||
print(json.dumps(result), file=f) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
#!/bin/bash | ||
|
||
#SBATCH --job-name=index_benchmark | ||
#SBATCH --partition=RM-shared | ||
#SBATCH --ntasks=4 | ||
#SBATCH --mem=8000 | ||
#SBATCH --time=04:00:00 | ||
#SBATCH --array=2-10 | ||
#SBATCH --account=med220004p | ||
|
||
# Activate conda | ||
eval "$(conda shell.bash hook)" | ||
|
||
method=ancpbids | ||
dataset="NKI-RS" | ||
bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset" | ||
workers=1 | ||
trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID) | ||
|
||
out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json" | ||
|
||
# Activate environment | ||
source envs/${method}/bin/activate | ||
which python | ||
|
||
echo python benchmark_indexing.py -w $workers $method $bids_dir $out | ||
python benchmark_indexing.py -w $workers $method $bids_dir $out |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
#!/bin/bash | ||
|
||
#SBATCH --job-name=index_benchmark | ||
#SBATCH --partition=RM-shared | ||
#SBATCH --time=04:00:00 | ||
#SBATCH --array=1 | ||
#SBATCH --account=med220004p | ||
|
||
# Activate conda | ||
eval "$(conda shell.bash hook)" | ||
|
||
method=bids2table | ||
dataset="NKI-RS" | ||
bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset" | ||
workers=${SLURM_NTASKS} | ||
trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID) | ||
|
||
out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json" | ||
|
||
# Activate environment | ||
source envs/${method}/bin/activate | ||
which python | ||
|
||
echo python benchmark_indexing.py -w $workers $method $bids_dir $out | ||
python benchmark_indexing.py -w $workers $method $bids_dir $out |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
#!/bin/bash | ||
|
||
#SBATCH --job-name=index_benchmark | ||
#SBATCH --partition=RM-shared | ||
#SBATCH --ntasks=4 | ||
#SBATCH --mem=8000 | ||
#SBATCH --time=04:00:00 | ||
#SBATCH --array=2-10 | ||
#SBATCH --account=med220004p | ||
|
||
# Activate conda | ||
eval "$(conda shell.bash hook)" | ||
|
||
method=pybids | ||
dataset="NKI-RS" | ||
bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset" | ||
workers=1 | ||
trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID) | ||
|
||
out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json" | ||
|
||
# Activate environment | ||
source envs/${method}/bin/activate | ||
which python | ||
|
||
echo python benchmark_indexing.py -w $workers $method $bids_dir $out | ||
python benchmark_indexing.py -w $workers $method $bids_dir $out |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
#!/bin/bash | ||
|
||
rm -r envs 2>/dev/null | ||
mkdir envs | ||
|
||
# Official pybids | ||
python3.10 -m venv envs/pybids && \ | ||
source envs/pybids/bin/activate && \ | ||
pip install -U pip && \ | ||
pip install pybids==0.16.0 && \ | ||
deactivate | ||
|
||
# ANCP-bids | ||
python3.10 -m venv envs/ancpbids && \ | ||
source envs/ancpbids/bin/activate && \ | ||
pip install -U pip && \ | ||
pip install ancpbids==0.2.2 && \ | ||
deactivate | ||
|
||
# bids2table | ||
python3.10 -m venv envs/bids2table && \ | ||
source envs/bids2table/bin/activate && \ | ||
pip install -U pip && \ | ||
pip install -U git+https://github.com/cmi-dair/elbow.git@3117427 && \ | ||
pip install -U git+https://github.com/cmi-dair/bids2table.git@b7b1658 && \ | ||
deactivate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Query benchmark | ||
|
||
In this benchmark we compare the query performance for [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and [bids2table](https://github.com/cmi-dair/bids2table). The queries are modeled after the [PyBIDS `BIDSLayout` tutorial](https://bids-standard.github.io/pybids/examples/pybids_tutorial.html#querying-the-bidslayout). | ||
|
||
For this benchmark, we use raw data from the [Chinese Color Nest Project](http://deepneuro.bnu.edu.cn/?p=163) (195 subjects, 2 resting state sessions per subject). | ||
|
||
## Setup environment | ||
|
||
We create a single environment that includes all three main libraries using the [setup_environments.sh](setup_environments.sh) script. The main libraries are pinned to the following versions. | ||
|
||
- PyBIDS: 0.16.0 | ||
- ancpBIDS: 0.2.2 | ||
- bids2table: 0.1.dev29+gb7b1658 | ||
|
||
## Run benchmark | ||
|
||
The [query_benchmark.ipynb](query_benchmark.ipynb) notebook runs the benchmark. We ran the benchmark on the [PSC bridges2](https://www.psc.edu/resources/bridges-2/) HPC system. | ||
|
||
## Results | ||
|
||
We compared the performance of the different indices on four queries: | ||
|
||
- Get subjects: Get a list of all unique subjects | ||
- Get BOLD: Get a list of all BOLD Nifti image files | ||
- Query Metadata: Find scans with a specific value for a sidecar metadata field | ||
- Get morning scans: Find scans that were acquired before 10 AM | ||
|
||
Below is a summary table of the query run times in milliseconds. We find that bids2table is >20x faster than PyBIDS and ancpBIDS. | ||
|
||
| Index | Get subjects (ms) | Get BOLD (ms) | Query metadata (ms) | Get morning scans (ms) | | ||
| -- | -- | -- | -- | -- | | ||
| PyBIDS | 1350 | 12.3 | 6.53 | 34.3 | | ||
| ancpBIDS | 30.6 | 19.2 | -- | -- | | ||
| bids2table | **0.046** | **0.346** | **0.312** | **0.352** | | ||
|
||
Note that ancpBIDS is missing values for the two queries that require accessing the sidecar metadata. It's possible that ancpBIDS supports these queries, but [looking at the documentation](https://ancpbids.readthedocs.io/en/latest/advancedQueries.html), it wasn't obvious to me how. |
Oops, something went wrong.