standard benchmark proposal (#150)

Co-authored-by: Paweł Czyż <[email protected]> Co-authored-by: Paweł Czyż <[email protected]>
cbg-ethz · Mar 5, 2024 · 4372045 · 4372045
1 parent bb2b3b7
commit 4372045
Show file tree

Hide file tree

Showing 42 changed files with 679 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -52,6 +52,36 @@ ksg = bmi.estimators.KSGEnsembleFirstEstimator(neighborhoods=(5,))
 print(f"Estimate by KSG: {ksg.estimate(X, Y):.2f}")
 ```
 
+## Evaluating a new estimator
+
+The above code snippet may be convenient for estimating mutual information on a given data set or for the development of a new mutual information estimator.
+However, for extensive benchmarking it may be more convenient to use one of the benchmark suites available in the `workflows/benchmark/` subdirectory.
+
+For example, you can install [Snakemake](https://snakemake.readthedocs.io/en/stable/) and run a small benchmark suite on several estimators using:
+
+```bash
+$ snakemake -c4 -s workflows/benchmark/demo/run.smk
+```
+
+In about a minute it should generate minibenchmark results in the `generated/benchmark/demo` directory. Note that the configuration file, `workflows/benchmark/demo/config.py`, explicitly defines the estimators and tasks used, as well as the number of samples.
+
+Hence, it is easy to benchmark a custom estimator by importing it and including it in the configuration dictionary.
+More information is available [here](https://cbg-ethz.github.io/bmi/benchmarking-new-estimator/), where we cover evaluating new Python as well as non-Python estimators.
+
+Similarly, it is easy to change the number of samples or adjust the tasks included in the benchmark.
+We defined several benchmark suites with shared structure.
+
+## List of implemented estimators
+
+(Your estimator can be here too! Please, reach out to us if you would like to contribute.)
+
+- The neighborhood-based KSG estimator proposed in [Estimating Mutual Information](https://arxiv.org/abs/cond-mat/0305641) by Kraskov et al. (2003).
+- Donsker-Varadhan and MINE estimators proposed in [MINE: Mutual Information Neural Estimation](https://arxiv.org/abs/1801.04062) by Belghazi et al. (2018).
+- InfoNCE estimator proposed in [Representation Learning with Contrastive Predictive Coding](https://arxiv.org/abs/1807.03748) by Oord et al. (2018).
+- NWJ estimator proposed in [Estimating divergence functionals and the likelihood ratio by convex risk minimization](https://arxiv.org/abs/0809.0853) by Nguyen et al. (2008).
+- Estimator based on canonical correlation analysis described in [Feature discovery under contextual supervision using mutual information](https://ieeexplore.ieee.org/document/227286) by Kay (1992) and in [Some data analyses using mutual information](https://www.jstor.org/stable/43601047) by Brillinger (2004).
+
+
 ## Citing
 
 If you find this code useful in your research, consider citing [our manuscript](https://papers.nips.cc/paper_files/paper/2023/hash/36b80eae70ff629d667f210e13497edf-Abstract-Conference.html):
@@ -70,10 +100,3 @@ If you find this code useful in your research, consider citing [our manuscript](
 }
 ```
 
-## List of estimators
-
-- The neighborhood-based KSG estimator proposed in [Estimating Mutual Information](https://arxiv.org/abs/cond-mat/0305641) by Kraskov et al. (2003).
-- Donsker-Varadhan and MINE estimators proposed in [MINE: Mutual Information Neural Estimation](https://arxiv.org/abs/1801.04062) by Belghazi et al. (2018).
-- InfoNCE estimator proposed in [Representation Learning with Contrastive Predictive Coding](https://arxiv.org/abs/1807.03748) by Oord et al. (2018).
-- NWJ estimator proposed in [Estimating divergence functionals and the likelihood ratio by convex risk minimization](https://arxiv.org/abs/0809.0853) by Nguyen et al. (2008).
-- Estimator based on canonical correlation analysis described in [Feature discovery under contextual supervision using mutual information](https://ieeexplore.ieee.org/document/227286) by Kay (1992) and in [Some data analyses using mutual information](https://www.jstor.org/stable/43601047) by Brillinger (2004).
diff --git a/docs/benchmarking-new-estimator.md b/docs/benchmarking-new-estimator.md
@@ -0,0 +1,153 @@
+# Benchmarking a new estimator
+
+Standard package utilities allow one to estimate mutual information on a given sample and can be convenient for the development of a new mutual information estimator.
+However, once a new estimator is ready, how can it be benchmarked? Running it manually on many distributions can take a lot of effort.
+
+Conveniently, [Snakemake](https://snakemake.readthedocs.io/en/stable/) is a workflow orchestrator which can generate benchmark tasks, save samples to the disk, and run many estimators.
+Once Snakemake has been installed, we recommend running:
+
+```bash
+$ snakemake -c4 -s workflows/benchmark/demo/run.smk
+```
+
+In about a minute it should generate minibenchmark results in the `generated/benchmark/demo` directory. Note that the configuration file, `workflows/benchmark/demo/config.py`, explicitly defines the estimators and tasks used, as well as the number of samples.
+
+Hence, it is possible to benchmark a *custom Python estimator* by simply importing it and including it in the configuration dictionary. Similarly, it is easy to change the number of samples or used tasks.
+The `config.py` file is plain Python!
+
+We defined several benchmark suites with the shared structure.
+
+## Adding a new Python estimator
+
+Every added estimator has to implement the [`IMutualInformationPointEstimator`](https://github.com/cbg-ethz/bmi/blob/main/src/bmi/interface.py) interface.
+
+Let's take a look at the simplest possible estimator (which generally shouldn't be used in practice), which estimates mutual information as $\hat I(X; Y) = -0.5 \log\left( 1-\mathrm{Corr}(X_1, Y_1)^2 \right)$.
+
+```python
+import numpy as np
+
+from bmi.interface import BaseModel, IMutualInformationPointEstimator
+
+class EmptyParams(BaseModel):
+    """This estimator has no hyperparameters."""
+    pass
+
+class TooSimpleEstimator(IMutualInformationPointEstimator):
+    def __init__(self) -> None:
+        """All estimator hyperparameters should be set at this stage.
+        This estimator does not have any hyperparameters, though.
+        """
+        pass
+
+    def estimate(self, x, y) -> float:
+        """Estimates mutual information.
+
+        Attrs:
+            x: samples from the X variable, shape (n_points, dim_x)
+            y: samples from the Y variable, shape (n_points, dim_y)
+        """
+        x, y = np.asarray(x), np.asarray(y)
+        x1 = x[:, 0]
+        y1 = y[:, 0]
+
+        rho = np.corrcoef(x1, y1)[0, 1]
+        return -0.5 * np.log(1 - rho**2)
+
+    def parameters(self) -> BaseModel:
+        """Returns the hyperparameters of the estimator."""
+        return EmptyParams()
+```
+
+If your estimator is a function, you can also wrap it into a class using [`FunctionalEstimator`](https://github.com/cbg-ethz/bmi/blob/main/src/bmi/estimators/function_wrapper.py) wrapper.
+
+Once such a class is available, it can be simply included into the configuration dictionary.
+
+## Adding a non-Python estimator
+
+How can one use the benchmark to evaluate an estimator implemented in another programming language?
+The benchmark employs Snakemake workflows in which each estimator has to implement the `IMutualInformationPointEstimator` interface.
+
+However, it's possible to create a thin wrapper, which whenever called will execute the following steps:
+
+- Save the samples to a temporary location.
+- Run an external script (e.g., in Julia, R, C++ or any other language) which prints mutual information estimate to the standard output.
+- Convert the standard output to `float` and return it as the estimate.
+
+The external script can be implemented in any language. Generally it will take three required arguments:
+
+1. The path to the CSV file with samples.
+2. Number of columns representing the $X$ variable, `dim_x`.
+3. Number of columns representing the $Y$ variable, `dim_y`.
+
+Additionally, it can take any additional arguments controlling the hyperparameters.
+Example scripts running several existing estimators implemented in Julia and R are [here](https://github.com/cbg-ethz/bmi/tree/main/external).
+
+Once a script is ready and can be run manually on CSV samples, it can be wrapped into a Python class implementing the `IMutualInformationPointEstimator` interface.
+Routines such as saving the samples to a CSV file or checking the standard output are standard, so we implemented a convenient class, [`ExternalEstimator`](https://github.com/cbg-ethz/bmi/blob/main/src/bmi/estimators/external/external_estimator.py).
+
+For example, let's take a look how the wrapper around the KSG estimator in R can be implemented.
+
+```python
+
+from pathlib import Path
+
+from bmi.estimators.external.external_estimator import ExternalEstimator
+from bmi.interface import BaseModel, Pathlike
+
+R_PATH = "path_to_the_R_script"
+
+
+class KSGParams(BaseModel):
+    """The hyperparameters of a given estimator.
+
+    Attrs:
+        neighbors (int): Number of neighbors to be used in the KSG algorithm
+    """
+    neighbors: int
+
+
+class RKSGEstimator(ExternalEstimator):
+    """The KSG estimators implemented in the `rmi` package in R."""
+
+    def __init__(self, neighbors: int = 10) -> None:
+        """
+        Args:
+            neighbors: number of neighbors (k) to be used
+        """
+        self._params = KSGParams(
+            variant=variant,
+            neighbors=neighbors,
+        )
+
+    def parameters(self) -> KSGParams:
+        """Returns the hyperparameters of the estimator."""
+        return self._params
+
+    def _build_command(self, path: Pathlike, dim_x: int, dim_y: int) -> list[str]:
+        sample_path_abs = str(Path(path).absolute())
+        return [
+            "Rscript",
+            estimator_r_path_abs,
+            sample_path_abs,
+            str(dim_x),
+            str(dim_y),
+            "--method",
+            f"KSG1",
+            "--neighbors",
+            str(self._params.neighbors),
+        ]
+
+
+estimator = RKSGEstimator(neighbors=5)
+# This is an ordinary estimator now and can be run with `estimator.estimate(X, Y)`
+```
+
+More Python wrappers around external scripts are implemented [here](https://github.com/cbg-ethz/bmi/tree/main/src/bmi/estimators/external).
+
+## FAQ
+
+### How to add an estimator implemented in another programming language?
+
+It's possible to quickly implement a thin Python wrapper around an estimator implemented in another language.
+See [this section](#adding-a-non-python-estimator) for details.
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -47,6 +47,7 @@ plugins:
 nav:
   - Home: index.md
   - Estimators: estimators.md
+  - Benchmark: benchmarking-new-estimator.md
   - Fine distributions: fine-distributions.md
   - Contributing: contributing.md
   - API: api/index.md

diff --git a/workflows/Beyond_Normal/README.md b/workflows/Beyond_Normal/README.md
diff --git a/workflows/README.md b/workflows/README.md
@@ -0,0 +1,7 @@
+# Workflows
+
+This directory contains [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflows. These workflows serve two different purposes:
+
+- `benchmark`: workflows for the development and benchmarking of new estimators. They are designed to be lean, modular, and easy to use.
+- `projects`: workflows we used to perform experiments in our manuscripts. They are included for reproducibility purposes, but they are less convenient for the development of new estimators.
+
diff --git a/workflows/benchmark/_common_benchmark_rules.smk b/workflows/benchmark/_common_benchmark_rules.smk
@@ -0,0 +1,86 @@
+# Main workflow for running the benchmark.
+
+import resource
+import yaml
+import pandas as pd
+
+from bmi.benchmark import run_estimator
+import _utils as utils
+
+
+TASKS_DICT = {
+    task.id: task for task in TASKS
+}
+
+
+rule benchmark_tables:
+    input: 'results.csv'
+    output: 'benchmark.html', 'benchmark_converged_only.html', 'benchmark_convergance.html'
+    run:
+        results = utils.read_results(str(input))
+        table = utils.create_benchmark_table(results, converged_only=False)
+        table.to_html('benchmark.html')
+
+        table = utils.create_benchmark_table(results, converged_only=True)
+        table.to_html('benchmark_converged_only.html')
+
+        table = utils.create_convergance_table(results)
+        table.to_html('benchmark_convergance.html')
+
+
+# Gather all results into one CSV file
+rule results:
+    output: 'results.csv'
+    input:
+        expand(
+            'results/{estimator_id}/{task_id}/{n_samples}-{seed}.yaml',
+            estimator_id=ESTIMATORS_DICT,
+            task_id=TASKS_DICT,
+            n_samples=N_SAMPLES,
+            seed=SEEDS,
+        )
+    run:
+        results = []
+        for result_path in input:
+            with open(result_path) as f:
+                result = yaml.load(f, yaml.SafeLoader)
+                task = TASKS_DICT[result['task_id']]
+                result['mi_true'] = task.mutual_information
+                result['task_name'] = task.name
+                result['task_params'] = task.params
+                results.append(result)
+        pd.DataFrame(results).to_csv(str(output), index=False)
+
+
+# Sample task given a seed and number of samples.
+rule sample_task:
+    output: 'tasks/{task_id}/{n_samples}-{seed}.csv'
+    run:
+        task_id = wildcards.task_id
+        n_samples = int(wildcards.n_samples)
+        seed = int(wildcards.seed)
+        task = TASKS_DICT[task_id]
+        task.save_sample(str(output), n_samples, seed)
+
+
+# Apply estimator to sample
+rule apply_estimator:
+    output: 'results/{estimator_id}/{task_id}/{n_samples}-{seed}.yaml'
+    input: 'tasks/{task_id}/{n_samples}-{seed}.csv'
+    run:
+        estimator_id = wildcards.estimator_id
+        estimator = ESTIMATORS_DICT[estimator_id]
+        task_id = wildcards.task_id
+        seed = int(wildcards.seed)
+
+        # this should be about ~4GiB
+        resource.setrlimit(resource.RLIMIT_DATA, (1<<33, 1<<33))
+
+        result = run_estimator(
+            estimator=estimator,
+            estimator_id=estimator_id,
+            sample_path=str(input),
+            task_id=task_id,
+            seed=seed,
+        )
+        result.dump(str(output))