Skip to content

Commit

Permalink
Add benchmarks (#14)
Browse files Browse the repository at this point in the history
Add benchmarks comparing [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and bids2table for indexing and querying large-ish datasets.
  • Loading branch information
clane9 authored Aug 9, 2023
1 parent b4fa239 commit 0659536
Show file tree
Hide file tree
Showing 14 changed files with 1,109 additions and 2 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,9 @@ htmlcov

# Example generated tables
example/tables

# Benchmark data
benchmark/**/envs
benchmark/**/results
benchmark/**/indexes
slurm-*
31 changes: 29 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# bids2table

bids2table is a lightweight tool to index large-scale BIDS neuroimaging datasets and derivatives. It is similar to [PyBIDS](https://github.com/bids-standard/pybids), but focused narrowly on just efficient index building.
bids2table is a library for efficiently indexing and querying large-scale BIDS neuroimaging datasets and derivatives. It aims to improve upon the efficiency of [PyBIDS](https://github.com/bids-standard/pybids) by leveraging modern data science tools.

bids2table represents a BIDS dataset index as a simple table with columns for BIDS entities and file metadata. The index is stored in [Parquet](https://parquet.apache.org/) format, a binary tabular file format optimized for efficient storage and retrieval. bids2table is built with [elbow](https://github.com/cmi-dair/elbow).
bids2table represents a BIDS dataset index as a single table with columns for BIDS entities and file metadata. The index is constructed using [Arrow](https://arrow.apache.org/) and stored in [Parquet](https://parquet.apache.org/) format, a binary tabular file format optimized for efficient storage and retrieval.

## Installation

Expand All @@ -28,3 +28,30 @@ df = bids2table("/path/to/dataset", persistent=True, workers=8)
```

See [here](example/example.ipynb) for a more complete example.

## Performance

bids2table significantly outperforms both [PyBIDS](https://github.com/bids-standard/pybids) and [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids) in terms of indexing run time, index size on disk, and query run time.

### Indexing performance

Indexing run time and index size on disk for the [NKI Rockland Sample](https://fcon_1000.projects.nitrc.org/indi/pro/nki.html) dataset. See the [indexing benchmark](benchmark/indexing) for more details.

| Index | Num workers | Run time (s) | Index size (MB) |
| -- | -- | -- | -- |
| PyBIDS | 1 | 1618 | 448 |
| ancpBIDS | 1 | 465 | -- |
| bids2table | 1 | 402 | 4.02 |
| bids2table | 8 | 53.2 | **3.84** |
| bids2table | 64 | **10.7** | 4.82 |


### Query performance

Query run times for the [Chinese Color Nest Project](http://deepneuro.bnu.edu.cn/?p=163) dataset. See the [query benchmark](benchmark/query) for more details.

| Index | Get subjects (ms) | Get BOLD (ms) | Query metadata (ms) | Get morning scans (ms) |
| -- | -- | -- | -- | -- |
| PyBIDS | 1350 | 12.3 | 6.53 | 34.3 |
| ancpBIDS | 30.6 | 19.2 | -- | -- |
| bids2table | **0.046** | **0.346** | **0.312** | **0.352** |
26 changes: 26 additions & 0 deletions benchmark/indexing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Indexing benchmark

In this benchmark we compare the indexing performance for [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and [bids2table](https://github.com/cmi-dair/bids2table) using the raw data from the [NKI Rockland Sample](https://fcon_1000.projects.nitrc.org/indi/pro/nki.html) (1334 subjects, 1.5TB).

## Setup environment

First install each library in a separate environment using the [setup_environments.sh](setup_environments.sh) script. The main libraries are pinned to the following versions.

- PyBIDS: 0.16.0
- ancpBIDS: 0.2.2
- bids2table: 0.1.dev29+gb7b1658

## Run benchmark

The [benchmark_indexing.py](benchmark_indexing.py) script runs the benchmark. We ran the benchmark on the [PSC bridges2](https://www.psc.edu/resources/bridges-2/) HPC system. The SLURM job scripts for each library are included in this directory.

## Results

We evaluated the indexing performance in terms of run time and index size on disk. We evaluated single-threaded performance for each method, as well as the parallel speedup for bids2table. We report the median for 10 runs.

In terms of single-threaded run time, bids2table is ~4x faster than PyBIDS and ~15% faster than ancpBIDS. In addition, the parallel speedup for bids2table is nearly linear. With 64 workers available, bids2table is ~150x faster than PyBIDS and ~35x faster than ancpBIDS.

In terms of index size, bids2table requires ~90x less disk space than PyBIDS, thanks to the efficient compression used by the [Parquet file format](https://parquet.apache.org/). The index size of ancpBIDS could not be compared, since it does not appear to support persisting the index to disk.


![Indexing performance](figures/indexing_performance.png)
244 changes: 244 additions & 0 deletions benchmark/indexing/analysis.ipynb

Large diffs are not rendered by default.

173 changes: 173 additions & 0 deletions benchmark/indexing/benchmark_indexing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
"""
A script to benchmark different methods for indexing BIDS datasets.
"""

import argparse
import json
import socket
import subprocess
import sys
import tempfile
import time
from pathlib import Path

try:
import bids

has_pybids = True
except ModuleNotFoundError:
has_pybids = False

try:
import ancpbids

has_ancpbids = True
except ModuleNotFoundError:
has_ancpbids = False

try:
import bids2table as b2t

has_bids2table = True
except ModuleNotFoundError:
has_bids2table = False

METHODS = ("pybids", "ancpbids", "bids2table")


def du(path):
size_kb = subprocess.check_output(["du", "-sk", path])
size_kb = int(size_kb.split()[0].decode("utf-8"))
size_mb = int(size_kb) / 1000
return size_mb


def benchmark_pybids(root: Path, workers: int):
assert has_pybids, "pybids not installed"
assert workers == 1, "pybids doesn't use multiple workers"
tic = time.monotonic()

with tempfile.TemporaryDirectory() as tmpdir:
indexer = bids.BIDSLayoutIndexer(
validate=False,
index_metadata=True,
)
bids.BIDSLayout(
root=root,
validate=False,
absolute_paths=True,
derivatives=True,
database_path=Path(tmpdir) / "index.db",
indexer=indexer,
)
size_mb = du(tmpdir)

elapsed = time.monotonic() - tic

result = {
"version": bids.__version__,
"elapsed": elapsed,
"size_mb": size_mb,
}
return result


def benchmark_ancpbids(root: Path, workers: int):
assert has_ancpbids, "ancpbids not installed"
assert workers == 1, "ancpbids doesn't use multiple workers"
tic = time.monotonic()

ancpbids.load_dataset(root)
# ancpbids is purely in-memory
size_mb = float("nan")

elapsed = time.monotonic() - tic

result = {
"version": ancpbids.__version__,
"elapsed": elapsed,
"size_mb": size_mb,
}
return result


def benchmark_bids2table(root: Path, workers: int):
assert has_bids2table, "bids2table not installed"
tic = time.monotonic()

with tempfile.TemporaryDirectory() as tmpdir:
b2t.bids2table(
root,
persistent=True,
output=Path(tmpdir) / "index.b2t",
workers=workers,
return_df=False,
)
size_mb = du(tmpdir)

elapsed = time.monotonic() - tic
result = {
"version": b2t.__version__,
"elapsed": elapsed,
"size_mb": size_mb,
}
return result


def main():
parser = argparse.ArgumentParser()

parser.add_argument(
"method",
metavar="METHOD",
type=str,
choices=METHODS,
help="Indexing method",
)
parser.add_argument(
"root",
metavar="ROOT",
type=Path,
help="Path to BIDS dataset",
)
parser.add_argument(
"out",
metavar="OUTPUT",
type=Path,
help="Path to JSON output",
)
parser.add_argument(
"--workers",
"-w",
metavar="COUNT",
type=int,
help="Number of worker processes",
default=1,
)

args = parser.parse_args()

if args.method == "pybids":
result = benchmark_pybids(args.root, args.workers)
if args.method == "ancpbids":
result = benchmark_ancpbids(args.root, args.workers)
elif args.method == "bids2table":
result = benchmark_bids2table(args.root, args.workers)
else:
raise NotImplementedError(f"Indexing method {args.method} not implemented")

result = {
"bids_dir": str(args.root.absolute()),
"method": args.method,
"workers": args.workers,
"py_version": sys.version,
"hostname": socket.gethostname(),
**result,
}

with open(args.out, "w") as f:
print(json.dumps(result), file=f)


if __name__ == "__main__":
main()
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 27 additions & 0 deletions benchmark/indexing/nki_ancpbids_batch_array.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash

#SBATCH --job-name=index_benchmark
#SBATCH --partition=RM-shared
#SBATCH --ntasks=4
#SBATCH --mem=8000
#SBATCH --time=04:00:00
#SBATCH --array=2-10
#SBATCH --account=med220004p

# Activate conda
eval "$(conda shell.bash hook)"

method=ancpbids
dataset="NKI-RS"
bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset"
workers=1
trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID)

out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json"

# Activate environment
source envs/${method}/bin/activate
which python

echo python benchmark_indexing.py -w $workers $method $bids_dir $out
python benchmark_indexing.py -w $workers $method $bids_dir $out
25 changes: 25 additions & 0 deletions benchmark/indexing/nki_bids2table_batch_array.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

#SBATCH --job-name=index_benchmark
#SBATCH --partition=RM-shared
#SBATCH --time=04:00:00
#SBATCH --array=1
#SBATCH --account=med220004p

# Activate conda
eval "$(conda shell.bash hook)"

method=bids2table
dataset="NKI-RS"
bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset"
workers=${SLURM_NTASKS}
trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID)

out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json"

# Activate environment
source envs/${method}/bin/activate
which python

echo python benchmark_indexing.py -w $workers $method $bids_dir $out
python benchmark_indexing.py -w $workers $method $bids_dir $out
27 changes: 27 additions & 0 deletions benchmark/indexing/nki_pybids_batch_array.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash

#SBATCH --job-name=index_benchmark
#SBATCH --partition=RM-shared
#SBATCH --ntasks=4
#SBATCH --mem=8000
#SBATCH --time=04:00:00
#SBATCH --array=2-10
#SBATCH --account=med220004p

# Activate conda
eval "$(conda shell.bash hook)"

method=pybids
dataset="NKI-RS"
bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset"
workers=1
trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID)

out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json"

# Activate environment
source envs/${method}/bin/activate
which python

echo python benchmark_indexing.py -w $workers $method $bids_dir $out
python benchmark_indexing.py -w $workers $method $bids_dir $out
26 changes: 26 additions & 0 deletions benchmark/indexing/setup_environments.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash

rm -r envs 2>/dev/null
mkdir envs

# Official pybids
python3.10 -m venv envs/pybids && \
source envs/pybids/bin/activate && \
pip install -U pip && \
pip install pybids==0.16.0 && \
deactivate

# ANCP-bids
python3.10 -m venv envs/ancpbids && \
source envs/ancpbids/bin/activate && \
pip install -U pip && \
pip install ancpbids==0.2.2 && \
deactivate

# bids2table
python3.10 -m venv envs/bids2table && \
source envs/bids2table/bin/activate && \
pip install -U pip && \
pip install -U git+https://github.com/cmi-dair/elbow.git@3117427 && \
pip install -U git+https://github.com/cmi-dair/bids2table.git@b7b1658 && \
deactivate
36 changes: 36 additions & 0 deletions benchmark/query/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Query benchmark

In this benchmark we compare the query performance for [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and [bids2table](https://github.com/cmi-dair/bids2table). The queries are modeled after the [PyBIDS `BIDSLayout` tutorial](https://bids-standard.github.io/pybids/examples/pybids_tutorial.html#querying-the-bidslayout).

For this benchmark, we use raw data from the [Chinese Color Nest Project](http://deepneuro.bnu.edu.cn/?p=163) (195 subjects, 2 resting state sessions per subject).

## Setup environment

We create a single environment that includes all three main libraries using the [setup_environments.sh](setup_environments.sh) script. The main libraries are pinned to the following versions.

- PyBIDS: 0.16.0
- ancpBIDS: 0.2.2
- bids2table: 0.1.dev29+gb7b1658

## Run benchmark

The [query_benchmark.ipynb](query_benchmark.ipynb) notebook runs the benchmark. We ran the benchmark on the [PSC bridges2](https://www.psc.edu/resources/bridges-2/) HPC system.

## Results

We compared the performance of the different indices on four queries:

- Get subjects: Get a list of all unique subjects
- Get BOLD: Get a list of all BOLD Nifti image files
- Query Metadata: Find scans with a specific value for a sidecar metadata field
- Get morning scans: Find scans that were acquired before 10 AM

Below is a summary table of the query run times in milliseconds. We find that bids2table is >20x faster than PyBIDS and ancpBIDS.

| Index | Get subjects (ms) | Get BOLD (ms) | Query metadata (ms) | Get morning scans (ms) |
| -- | -- | -- | -- | -- |
| PyBIDS | 1350 | 12.3 | 6.53 | 34.3 |
| ancpBIDS | 30.6 | 19.2 | -- | -- |
| bids2table | **0.046** | **0.346** | **0.312** | **0.352** |

Note that ancpBIDS is missing values for the two queries that require accessing the sidecar metadata. It's possible that ancpBIDS supports these queries, but [looking at the documentation](https://ancpbids.readthedocs.io/en/latest/advancedQueries.html), it wasn't obvious to me how.
Loading

0 comments on commit 0659536

Please sign in to comment.