Add benchmarks (#14)

Add benchmarks comparing [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and bids2table for indexing and querying large-ish datasets.
childmindresearch · Aug 9, 2023 · 0659536 · 0659536
1 parent b4fa239
commit 0659536
Show file tree

Hide file tree

Showing 14 changed files with 1,109 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -34,3 +34,9 @@ htmlcov
 
 # Example generated tables
 example/tables
+
+# Benchmark data
+benchmark/**/envs
+benchmark/**/results
+benchmark/**/indexes
+slurm-*
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 # bids2table
 
-bids2table is a lightweight tool to index large-scale BIDS neuroimaging datasets and derivatives. It is similar to [PyBIDS](https://github.com/bids-standard/pybids), but focused narrowly on just efficient index building.
+bids2table is a library for efficiently indexing and querying large-scale BIDS neuroimaging datasets and derivatives. It aims to improve upon the efficiency of [PyBIDS](https://github.com/bids-standard/pybids) by leveraging modern data science tools.
 
-bids2table represents a BIDS dataset index as a simple table with columns for BIDS entities and file metadata. The index is stored in [Parquet](https://parquet.apache.org/) format, a binary tabular file format optimized for efficient storage and retrieval. bids2table is built with [elbow](https://github.com/cmi-dair/elbow).
+bids2table represents a BIDS dataset index as a single table with columns for BIDS entities and file metadata. The index is constructed using [Arrow](https://arrow.apache.org/) and stored in [Parquet](https://parquet.apache.org/) format, a binary tabular file format optimized for efficient storage and retrieval.
 
 ## Installation
 
@@ -28,3 +28,30 @@ df = bids2table("/path/to/dataset", persistent=True, workers=8)
 ```
 
 See [here](example/example.ipynb) for a more complete example.
+
+## Performance
+
+bids2table significantly outperforms both [PyBIDS](https://github.com/bids-standard/pybids) and [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids) in terms of indexing run time, index size on disk, and query run time.
+
+### Indexing performance
+
+Indexing run time and index size on disk for the [NKI Rockland Sample](https://fcon_1000.projects.nitrc.org/indi/pro/nki.html) dataset. See the [indexing benchmark](benchmark/indexing) for more details.
+
+| Index | Num workers | Run time (s) | Index size (MB) |
+| -- | -- | -- | -- |
+| PyBIDS | 1 | 1618 | 448 |
+| ancpBIDS | 1 | 465 | -- |
+| bids2table | 1 | 402 | 4.02 |
+| bids2table | 8 | 53.2 | **3.84** |
+| bids2table | 64 | **10.7** | 4.82 |
+
+
+### Query performance
+
+Query run times for the [Chinese Color Nest Project](http://deepneuro.bnu.edu.cn/?p=163) dataset. See the [query benchmark](benchmark/query) for more details.
+
+| Index | Get subjects (ms) | Get BOLD (ms) | Query metadata (ms) | Get morning scans (ms) |
+| -- | -- | -- | -- | -- |
+| PyBIDS | 1350 | 12.3 | 6.53 | 34.3 |
+| ancpBIDS | 30.6 | 19.2 | -- | -- |
+| bids2table | **0.046** | **0.346** | **0.312** | **0.352** |
diff --git a/benchmark/indexing/README.md b/benchmark/indexing/README.md
@@ -0,0 +1,26 @@
+# Indexing benchmark
+
+In this benchmark we compare the indexing performance for [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and [bids2table](https://github.com/cmi-dair/bids2table) using the raw data from the [NKI Rockland Sample](https://fcon_1000.projects.nitrc.org/indi/pro/nki.html) (1334 subjects, 1.5TB).
+
+## Setup environment
+
+First install each library in a separate environment using the [setup_environments.sh](setup_environments.sh) script. The main libraries are pinned to the following versions.
+
+- PyBIDS: 0.16.0
+- ancpBIDS: 0.2.2
+- bids2table: 0.1.dev29+gb7b1658
+
+## Run benchmark
+
+The [benchmark_indexing.py](benchmark_indexing.py) script runs the benchmark. We ran the benchmark on the [PSC bridges2](https://www.psc.edu/resources/bridges-2/) HPC system. The SLURM job scripts for each library are included in this directory.
+
+## Results
+
+We evaluated the indexing performance in terms of run time and index size on disk. We evaluated single-threaded performance for each method, as well as the parallel speedup for bids2table. We report the median for 10 runs.
+
+In terms of single-threaded run time, bids2table is ~4x faster than PyBIDS and ~15% faster than ancpBIDS. In addition, the parallel speedup for bids2table is nearly linear. With 64 workers available, bids2table is ~150x faster than PyBIDS and ~35x faster than ancpBIDS.
+
+In terms of index size, bids2table requires ~90x less disk space than PyBIDS, thanks to the efficient compression used by the [Parquet file format](https://parquet.apache.org/). The index size of ancpBIDS could not be compared, since it does not appear to support persisting the index to disk.
+
+
+![Indexing performance](figures/indexing_performance.png)
diff --git a/benchmark/indexing/analysis.ipynb b/benchmark/indexing/analysis.ipynb
diff --git a/benchmark/indexing/benchmark_indexing.py b/benchmark/indexing/benchmark_indexing.py
@@ -0,0 +1,173 @@
+"""
+A script to benchmark different methods for indexing BIDS datasets.
+"""
+
+import argparse
+import json
+import socket
+import subprocess
+import sys
+import tempfile
+import time
+from pathlib import Path
+
+try:
+    import bids
+
+    has_pybids = True
+except ModuleNotFoundError:
+    has_pybids = False
+
+try:
+    import ancpbids
+
+    has_ancpbids = True
+except ModuleNotFoundError:
+    has_ancpbids = False
+
+try:
+    import bids2table as b2t
+
+    has_bids2table = True
+except ModuleNotFoundError:
+    has_bids2table = False
+
+METHODS = ("pybids", "ancpbids", "bids2table")
+
+
+def du(path):
+    size_kb = subprocess.check_output(["du", "-sk", path])
+    size_kb = int(size_kb.split()[0].decode("utf-8"))
+    size_mb = int(size_kb) / 1000
+    return size_mb
+
+
+def benchmark_pybids(root: Path, workers: int):
+    assert has_pybids, "pybids not installed"
+    assert workers == 1, "pybids doesn't use multiple workers"
+    tic = time.monotonic()
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        indexer = bids.BIDSLayoutIndexer(
+            validate=False,
+            index_metadata=True,
+        )
+        bids.BIDSLayout(
+            root=root,
+            validate=False,
+            absolute_paths=True,
+            derivatives=True,
+            database_path=Path(tmpdir) / "index.db",
+            indexer=indexer,
+        )
+        size_mb = du(tmpdir)
+
+    elapsed = time.monotonic() - tic
+
+    result = {
+        "version": bids.__version__,
+        "elapsed": elapsed,
+        "size_mb": size_mb,
+    }
+    return result
+
+
+def benchmark_ancpbids(root: Path, workers: int):
+    assert has_ancpbids, "ancpbids not installed"
+    assert workers == 1, "ancpbids doesn't use multiple workers"
+    tic = time.monotonic()
+
+    ancpbids.load_dataset(root)
+    # ancpbids is purely in-memory
+    size_mb = float("nan")
+
+    elapsed = time.monotonic() - tic
+
+    result = {
+        "version": ancpbids.__version__,
+        "elapsed": elapsed,
+        "size_mb": size_mb,
+    }
+    return result
+
+
+def benchmark_bids2table(root: Path, workers: int):
+    assert has_bids2table, "bids2table not installed"
+    tic = time.monotonic()
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        b2t.bids2table(
+            root,
+            persistent=True,
+            output=Path(tmpdir) / "index.b2t",
+            workers=workers,
+            return_df=False,
+        )
+        size_mb = du(tmpdir)
+
+    elapsed = time.monotonic() - tic
+    result = {
+        "version": b2t.__version__,
+        "elapsed": elapsed,
+        "size_mb": size_mb,
+    }
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "method",
+        metavar="METHOD",
+        type=str,
+        choices=METHODS,
+        help="Indexing method",
+    )
+    parser.add_argument(
+        "root",
+        metavar="ROOT",
+        type=Path,
+        help="Path to BIDS dataset",
+    )
+    parser.add_argument(
+        "out",
+        metavar="OUTPUT",
+        type=Path,
+        help="Path to JSON output",
+    )
+    parser.add_argument(
+        "--workers",
+        "-w",
+        metavar="COUNT",
+        type=int,
+        help="Number of worker processes",
+        default=1,
+    )
+
+    args = parser.parse_args()
+
+    if args.method == "pybids":
+        result = benchmark_pybids(args.root, args.workers)
+    if args.method == "ancpbids":
+        result = benchmark_ancpbids(args.root, args.workers)
+    elif args.method == "bids2table":
+        result = benchmark_bids2table(args.root, args.workers)
+    else:
+        raise NotImplementedError(f"Indexing method {args.method} not implemented")
+
+    result = {
+        "bids_dir": str(args.root.absolute()),
+        "method": args.method,
+        "workers": args.workers,
+        "py_version": sys.version,
+        "hostname": socket.gethostname(),
+        **result,
+    }
+
+    with open(args.out, "w") as f:
+        print(json.dumps(result), file=f)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmark/indexing/figures/indexing_performance.png b/benchmark/indexing/figures/indexing_performance.png
diff --git a/benchmark/indexing/nki_ancpbids_batch_array.sh b/benchmark/indexing/nki_ancpbids_batch_array.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+#SBATCH --job-name=index_benchmark
+#SBATCH --partition=RM-shared
+#SBATCH --ntasks=4
+#SBATCH --mem=8000
+#SBATCH --time=04:00:00
+#SBATCH --array=2-10
+#SBATCH --account=med220004p
+
+# Activate conda
+eval "$(conda shell.bash hook)"
+
+method=ancpbids
+dataset="NKI-RS"
+bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset"
+workers=1
+trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID)
+
+out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json"
+
+# Activate environment
+source envs/${method}/bin/activate
+which python
+
+echo python benchmark_indexing.py -w $workers $method $bids_dir $out
+python benchmark_indexing.py -w $workers $method $bids_dir $out
diff --git a/benchmark/indexing/nki_bids2table_batch_array.sh b/benchmark/indexing/nki_bids2table_batch_array.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+#SBATCH --job-name=index_benchmark
+#SBATCH --partition=RM-shared
+#SBATCH --time=04:00:00
+#SBATCH --array=1
+#SBATCH --account=med220004p
+
+# Activate conda
+eval "$(conda shell.bash hook)"
+
+method=bids2table
+dataset="NKI-RS"
+bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset"
+workers=${SLURM_NTASKS}
+trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID)
+
+out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json"
+
+# Activate environment
+source envs/${method}/bin/activate
+which python
+
+echo python benchmark_indexing.py -w $workers $method $bids_dir $out
+python benchmark_indexing.py -w $workers $method $bids_dir $out
diff --git a/benchmark/indexing/nki_pybids_batch_array.sh b/benchmark/indexing/nki_pybids_batch_array.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+#SBATCH --job-name=index_benchmark
+#SBATCH --partition=RM-shared
+#SBATCH --ntasks=4
+#SBATCH --mem=8000
+#SBATCH --time=04:00:00
+#SBATCH --array=2-10
+#SBATCH --account=med220004p
+
+# Activate conda
+eval "$(conda shell.bash hook)"
+
+method=pybids
+dataset="NKI-RS"
+bids_dir="/ocean/projects/med220004p/shared/data_raw/NKI-RS/dataset"
+workers=1
+trial_id=$(printf '%03d' $SLURM_ARRAY_TASK_ID)
+
+out="results/ds-${dataset}_method-${method}_workers-${workers}_trial-${trial_id}.json"
+
+# Activate environment
+source envs/${method}/bin/activate
+which python
+
+echo python benchmark_indexing.py -w $workers $method $bids_dir $out
+python benchmark_indexing.py -w $workers $method $bids_dir $out
diff --git a/benchmark/indexing/setup_environments.sh b/benchmark/indexing/setup_environments.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+
+rm -r envs 2>/dev/null
+mkdir envs
+
+# Official pybids
+python3.10 -m venv envs/pybids && \
+    source envs/pybids/bin/activate && \
+    pip install -U pip && \
+    pip install pybids==0.16.0 && \
+    deactivate
+
+# ANCP-bids
+python3.10 -m venv envs/ancpbids && \
+    source envs/ancpbids/bin/activate && \
+    pip install -U pip && \
+    pip install ancpbids==0.2.2 && \
+    deactivate
+
+# bids2table
+python3.10 -m venv envs/bids2table && \
+    source envs/bids2table/bin/activate && \
+    pip install -U pip && \
+    pip install -U git+https://github.com/cmi-dair/elbow.git@3117427 && \
+    pip install -U git+https://github.com/cmi-dair/bids2table.git@b7b1658 && \
+    deactivate
diff --git a/benchmark/query/README.md b/benchmark/query/README.md
@@ -0,0 +1,36 @@
+# Query benchmark
+
+In this benchmark we compare the query performance for [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and [bids2table](https://github.com/cmi-dair/bids2table). The queries are modeled after the [PyBIDS `BIDSLayout` tutorial](https://bids-standard.github.io/pybids/examples/pybids_tutorial.html#querying-the-bidslayout).
+
+For this benchmark, we use raw data from the [Chinese Color Nest Project](http://deepneuro.bnu.edu.cn/?p=163) (195 subjects, 2 resting state sessions per subject).
+
+## Setup environment
+
+We create a single environment that includes all three main libraries using the [setup_environments.sh](setup_environments.sh) script. The main libraries are pinned to the following versions.
+
+- PyBIDS: 0.16.0
+- ancpBIDS: 0.2.2
+- bids2table: 0.1.dev29+gb7b1658
+
+## Run benchmark
+
+The [query_benchmark.ipynb](query_benchmark.ipynb) notebook runs the benchmark. We ran the benchmark on the [PSC bridges2](https://www.psc.edu/resources/bridges-2/) HPC system.
+
+## Results
+
+We compared the performance of the different indices on four queries:
+
+- Get subjects: Get a list of all unique subjects
+- Get BOLD: Get a list of all BOLD Nifti image files
+- Query Metadata: Find scans with a specific value for a sidecar metadata field
+- Get morning scans: Find scans that were acquired before 10 AM
+
+Below is a summary table of the query run times in milliseconds. We find that bids2table is >20x faster than PyBIDS and ancpBIDS.
+
+| Index | Get subjects (ms) | Get BOLD (ms) | Query metadata (ms) | Get morning scans (ms) |
+| -- | -- | -- | -- | -- |
+| PyBIDS | 1350 | 12.3 | 6.53 | 34.3 |
+| ancpBIDS | 30.6 | 19.2 | -- | -- |
+| bids2table | **0.046** | **0.346** | **0.312** | **0.352** |
+
+Note that ancpBIDS is missing values for the two queries that require accessing the sidecar metadata. It's possible that ancpBIDS supports these queries, but [looking at the documentation](https://ancpbids.readthedocs.io/en/latest/advancedQueries.html), it wasn't obvious to me how.