deploy: 3227bb5

owkin · Nov 28, 2023 · 6a430a6 · 6a430a6
commit 6a430a6
Show file tree

Hide file tree

Showing 67 changed files with 21,913 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 37c80a94381c99d217356b4da8e6e16f
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/api/algorithms.rst.txt b/_sources/api/algorithms.rst.txt
@@ -0,0 +1,6 @@
+fedeca.algorithms
+=========================
+
+.. currentmodule:: fedeca.algorithms
+
+.. autoclass:: fedeca.algorithms.TorchWebDiscoAlgo
diff --git a/_sources/api/competitors.rst.txt b/_sources/api/competitors.rst.txt
@@ -0,0 +1,8 @@
+fedeca.competitors
+=========================
+
+.. autoclass:: fedeca.PooledIPTW
+
+.. autoclass:: fedeca.MatchingAjudsted
+
+.. autoclass:: fedeca.NaiveComparison
diff --git a/_sources/api/iptw.rst.txt b/_sources/api/iptw.rst.txt
@@ -0,0 +1,4 @@
+fedeca.fedeca
+=========================
+
+.. autoclass:: fedeca.FedECA
diff --git a/_sources/api/metrics.rst.txt b/_sources/api/metrics.rst.txt
@@ -0,0 +1,4 @@
+fedeca.metrics
+=========================
+
+.. automodule:: fedeca.metrics.metrics
diff --git a/_sources/api/scripts.rst.txt b/_sources/api/scripts.rst.txt
@@ -0,0 +1,4 @@
+fedeca.scripts
+=========================
+
+.. autoclass:: fedeca.scripts.substra_assets.csv_opener.CSVOpener
diff --git a/_sources/api/strategies.rst.txt b/_sources/api/strategies.rst.txt
@@ -0,0 +1,8 @@
+fedeca.strategies
+=========================
+
+.. currentmodule:: fedeca.strategies.webdisco
+
+.. autoclass:: fedeca.strategies.WebDisco
+
+.. automodule:: fedeca.strategies.webdisco_utils
diff --git a/_sources/api/utils.rst.txt b/_sources/api/utils.rst.txt
@@ -0,0 +1,14 @@
+fedeca.utils
+=========================
+
+.. automodule:: fedeca.utils.data_utils
+
+.. automodule:: fedeca.utils.experiments_utils
+
+.. automodule:: fedeca.utils.moments_utils
+
+.. automodule:: fedeca.utils.substrafl_utils
+
+.. automodule:: fedeca.utils.tensor_utils
+
+.. automodule:: fedeca.utils.typing
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,57 @@
+FedECA documentation
+======================
+
+This package allows to perform both simulations and deployments of federated
+external control arms (FedECA) analyses.
+
+Before using this code make sure to: 
+
+#. read and accept the terms of the license license.md that can be found at the root of the repository.
+#. read `substra's privacy strategy <https://docs.substra.org/en/stable/additional/privacy-strategy.html>`_
+#. read our associated technical article
+#. `activate secure rng in Opacus <https://opacus.ai/docs/faq#:~:text=What%20is%20the%20secure_rng,the%20security%20this%20brings.>`_ if you plan on using differential privacy.
+
+
+
+Citing this work
+----------------
+
+::
+
+   @article{duterrailfedeca2023,
+   title={FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings},
+   author={Ogier du Terrail, Jean and Klopfenstein, Quentin and Li Honghao and Mayer, Imke and Loiseau, Nicolas and Hallal Mohammad and Balazard, Félix and Andreux, Mathieu},
+   year={2023},
+   doi = {no.doi.yet},
+   journal={arXiv},
+   }
+
+License
+-------
+
+FedECA is released under a custom license that can be found under license.md at the root of the repository.
+
+.. toctree::
+   :maxdepth: 0
+   :caption: Installation
+
+   installation
+
+.. toctree::
+   :maxdepth: 0
+   :caption: Getting Started Instructions
+
+   quickstart
+
+.. toctree::
+   :hidden:
+   :maxdepth: 4
+   :caption: API
+
+   api/fedeca
+   api/competitors
+   api/algorithms
+   api/metrics
+   api/scripts
+   api/strategies
+   api/utils
diff --git a/_sources/installation.rst.txt b/_sources/installation.rst.txt
@@ -0,0 +1,22 @@
+
+Installation
+============
+
+To install the package, create an env with python ``3.9`` with conda
+
+.. code-block:: bash
+
+   conda create -n fedeca python=3.9
+   conda activate fedeca
+
+Within the environment, install the package by running:
+
+.. code-block::
+
+   git clone https://github.com/owkin/fedeca.git
+   pip install -e ".[all_extra]"
+
+If you plan developing, you should also install the pre-commit hooks
+
+```bash
+pre-commit install
diff --git a/_sources/quickstart.rst.txt b/_sources/quickstart.rst.txt
@@ -0,0 +1,178 @@
+
+Quickstart
+----------
+
+FedECA tries to mimic scikit-learn API as much as possible with the constraints
+of distributed learning.
+The first step in data science is always the data.
+We need to first use or generate some survival data in pandas.dataframe format.
+Note that fedeca should work on any data format, provided that the
+return type of the substra opener is indeed a pandas.dataframe but let's keep
+it simple in this quickstart.
+
+Here we will use fedeca utils which will generate some synthetic survival data
+following CoxPH assumptions:
+
+.. code-block:: python
+
+   import pandas as pd
+   from fedeca.utils.survival_utils import CoxData
+   # Let's generate 1000 data samples with 10 covariates
+   data = CoxData(seed=42, n_samples=1000, ndim=10)
+   df = data.generate_dataframe()
+
+   # We remove the true propensity score
+   df = df.drop(columns=["propensity_scores"], axis=1)
+
+Let's inspect the data that we have here.
+
+.. code-block:: python
+
+   print(df.info())
+   # <class 'pandas.core.frame.DataFrame'>
+   # RangeIndex: 1000 entries, 0 to 999
+   # Data columns (total 13 columns):
+   #  #   Column     Non-Null Count  Dtype
+   # ---  ------     --------------  -----
+   #  0   X_0        1000 non-null   float64
+   #  1   X_1        1000 non-null   float64
+   #  2   X_2        1000 non-null   float64
+   #  3   X_3        1000 non-null   float64
+   #  4   X_4        1000 non-null   float64
+   #  5   X_5        1000 non-null   float64
+   #  6   X_6        1000 non-null   float64
+   #  7   X_7        1000 non-null   float64
+   #  8   X_8        1000 non-null   float64
+   #  9   X_9        1000 non-null   float64
+   #  10  time       1000 non-null   float64
+   #  11  event      1000 non-null   uint8
+   #  12  treatment  1000 non-null   uint8
+   # dtypes: float64(11), uint8(2)
+   # memory usage: 88.0 KB
+   print(df.head())
+   #         X_0       X_1       X_2       X_3       X_4       X_5       X_6       X_7       X_8       X_9      time  event  treatment
+   # 0 -0.918373 -0.814340 -0.148994  0.482720 -1.130384 -1.254769 -0.462002  1.451622  1.199705  0.133197  2.573516      1          1
+   # 1  0.360051 -0.863619  0.198673  0.330630 -0.189184 -0.802424 -1.694990 -0.989009 -0.421245 -0.112665  0.519108      1          1
+   # 2  0.442502  0.024682  0.069500 -0.398015 -0.521236 -0.824907  0.373018  1.016843  0.765661  0.858817  0.652803      1          1
+   # 3 -0.783965 -1.116391 -1.482413 -2.039827 -1.639304 -0.500380 -0.298467 -1.801688 -0.743004 -0.724039  0.074925      1          1
+   # 4 -0.199620 -0.652347 -0.018776  0.004630 -0.122242 -0.413490 -0.450718 -0.761894 -1.323135 -0.234899  0.006951      1          1
+   print(df["treatment"].unique())
+   # array([1, 0], dtype=uint8)
+   df["treatment"].sum()
+   # 500
+
+So we have survival data with covariates and a binary treatment variable.
+Let's inspect it using proper survival plots using the great survival analysis
+package `lifelines <https://github.com/CamDavidsonPilon/lifelines>`_ that was a
+source of inspiration for fedeca:
+
+.. code-block:: python
+
+   from lifelines import KaplanMeierFitter as KMF
+   import matplotlib.pyplot as plt
+   treatments = [0, 1]
+   kms = [KMF().fit(durations=df.loc[df["treatment"] == t]["time"], event_observed=df.loc[df["treatment"] == t]["event"]) for t in treatments]
+
+   axs = [km.plot(label="treated" if t == 1 else "untreated") for km, t in zip(kms, treatments)]
+   axs[-1].set_ylabel("Survival Probability")
+   plt.xlim(0, 1500)
+   plt.savefig("treated_vs_untreated.pdf", bbox_inches="tight")
+
+Open ``treated_vs_untreated.pdf`` in your favorite pdf viewer and see for yourself.
+
+Pooled IPTW analysis
+--------------------
+
+The treatment seems to improve survival but it's hard to say for sure as it might
+simply be due to chance or sampling bias.
+Let's perform an IPTW analysis to be sure:
+
+.. code-block:: python
+
+   from fedeca.competitors import PooledIPTW
+   pooled_iptw = PooledIPTW(treated_col="treatment", event_col="event", duration_col="time")
+   # Targets is the propensity weights
+   pooled_iptw.fit(data=df, targets=None)
+   print(pooled_iptw.results_)
+   #                coef  exp(coef)  se(coef)  coef lower 95%  coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%  cmp to         z         p  -log2(p)
+   # covariate
+   # treatment  0.041727    1.04261  0.070581       -0.096609        0.180064             0.907911             1.197294     0.0  0.591196  0.554389   0.85103
+
+When looking at the ``p-value=0.554389 > 0.05``\ , thus judging by what we observe we
+cannot say for sure that there is a treatment effect. We say the ATE is non significant.
+
+Distributed Analysis
+--------------------
+
+However in practice data is private and held by different institutions. Therefore
+in practice each client holds a subset of the rows of our dataframe.
+We will simulate this using a realistic scenario where a "pharma" node is developing
+a new drug and thus holds all treated and the rest of the data is split across
+3 other institutions where patients were treated with the old drug.
+We will use the split utils of FedECA.
+
+.. code-block:: python
+
+   from fedeca.utils.data_utils import split_dataframe_across_clients
+
+   clients, train_data_nodes, _, _, _ = split_dataframe_across_clients(
+       df,
+       n_clients=4,
+       split_method= "split_control_over_centers",
+       split_method_kwargs={"treatment_info": "treatment"},
+       data_path="./data",
+       backend_type="simu",
+   )
+
+Note that you can replace split_method by any callable with the signature
+``pd.DataFrame -> list[int]`` where the list of ints is the split of the indices
+of the df across the different institutions.
+To convince you that the split was effective you can inspect the folder "./data".
+You will find different subfolders ``center0`` to ``center3`` each with different
+parts of the data.
+To unpack a bit what is going on in more depth, we have created a dict of client
+'clients',
+which is a dict with 4 keys containing substra API handles towards the different
+institutions and their data.
+``train_data_nodes`` is a list of handles towards the datasets of the different institutions
+that were registered through the substra interface using the data in the different
+folders.
+You might have noticed that we did not talk about the ``backend_type`` argument. 
+This argument is used to choose on which network will experiments be run.
+"simu" means in-RAM. If you finish this tutorial do try other values such as:
+"docker" or "subprocess" but expect a significant slow-down as experiments
+get closer and closer to a real distributed system.
+
+Now let's try to see if we can reproduce the pooled anaysis in this much more
+complicated distributed setting:
+
+.. code-block:: python
+
+   from fedeca import FedECA
+   # We use the first client as the node, which launches order
+   ds_client = clients[list(clients.keys())[0]]
+   fed_iptw = FedECA(ndim=10, ds_client=ds_client, train_data_nodes=train_data_nodes, treated_col="treatment", duration_col="time", event_col="event", robust=True)
+   fed_iptw.run()
+   # Final partial log-likelihood:
+   # [-11499.19619422]
+   #        coef  se(coef)  coef lower 95%  coef upper 95%         z         p  exp(coef)  exp(coef) lower 95%  exp(coef) upper 95%
+   # 0  0.041718  0.070581       -0.096618        0.180054  0.591062  0.554479     1.0426             0.907902             1.197282
+
+In fact what we did above is both quite verbose. For simulation purposes we
+advise to use directly the scikit-learn inspired syntax:
+
+.. code-block:: python
+
+   from fedeca import FedECA
+
+   fed_iptw = FedECA(ndim=10, treated_col="treatment", event_col="event", duration_col="time")
+   fed_iptw.fit(df, n_clients=4, split_method="split_control_over_centers", split_method_kwargs={"treatment_info": "treatment"}, data_path="./data", robust=True, backend_type="simu")
+   #        coef  se(coef)  coef lower 95%  coef upper 95%         z         p  exp(coef)  exp(coef) lower 95%  exp(coef) upper 95%
+   # 0  0.041718  0.070581       -0.096618        0.180054  0.591062  0.554479     1.0426             0.907902             1.197282
+
+We find a similar p-value ! The distributed analysis is working as expected.
+We recommend to users that made it to here as a next step to use their own data
+and write custom split functions and to test this pipeline under various
+heterogeneity settings.
+Another interesting avenue is to try adding differential privacy to the training
+of the propensity model but that is outside the scope of this quickstart.