Feat: Improve doc (#2)

arnaudon · web-flow · commit ce49a4ed0244 · 2024-02-29T18:20:43.000+01:00
diff --git a/AUTHORS.md b/AUTHORS.md
@@ -5,3 +5,4 @@ Adrien Berchet (@adrien-berchet)
 # Contributors
 
 * Alexis Arnaudon (@arnaudon)
+* Gianluca Ficarelli (@GianlucaFicarelli)
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ Provides an embarrassingly parallel tool with sql backend.
 
 ## Introduction
 
-Provides an embarrassingly parallel tool with sql backend, inspired by [BluePyMM](https://github.com/BlueBrain/BluePyMM).
+Provides an embarrassingly parallel tool with sql backend, inspired by [BluePyMM](https://github.com/BlueBrain/BluePyMM) of @wvangeit.
 
 
 ## Installation
@@ -41,11 +41,34 @@ mapper = parallel_factory.get_mapper()
 result = sorted(mapper(function, mapped_data, *function_args, **function_kwargs))
 ```
 
-
-### Working with Pandas and SQL backend
+### Working with Pandas
 
 This library provides a specific function working with large :class:`pandas.DataFrame`: :func:`bluepyparallel.evaluator.evaluate`.
 This function converts the DataFrame into a list of dict (one for each row), then maps a given function to element and finally gathers the results.
+
+Example:
+
+```python
+input_df = pd.DataFrame(index=[1, 2], columns=['data'], data=[100, 200])
+
+def evaluation_function(row):
+    result_1, result_2 = compute_something(row['data'])
+    return {'new_column_1': result_1, 'new_columns_2': result_2}
+
+# Use the mapper to map the given function to each element of the DataFrame
+result_df = evaluate(
+    input_df,  # This is the DataFrame to process
+    evaluation_function,  # This is the function that should be applied to each row of the DataFrame
+    parallel_factory="multiprocessing",  # This could also be a Factory previously defined
+    new_columns=[['new_column_1', 0], ['new_columns_2', None]],  # this defines default values for columns
+)
+assert result_df.columns == ['data', 'new_columns_1', 'new_columns_2']
+```
+It is in a way  a generalisation of the pandas `.apply` method.
+
+
+### Working with an SQL backend
+
 As it aims at working with time consuming functions, it also provides a checkpoint and resume mechanism using a SQL backend.
 The SQL backend uses the [SQLAlchemy](https://docs.sqlalchemy.org) library, so it can work with a large variety of database types (like SQLite, PostgreSQL, MySQL, ...).
 To activate this feature, just pass a [URL that can be processed by SQLAlchemy](https://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=url#database-urls)  to the ``db_url`` parameter of :func:`bluepyparallel.evaluator.evaluate`.
@@ -69,10 +92,11 @@ If the crash was due to an external cause (therefore executing the code again sh
 computation from the last computed element. Thus, only the missing elements are computed, which can save a lot of time.
 
 
-## Running using Dask
+## Running with distributed Dask MPI on HPC systems
 
-This is an example of a [sbatch](https://slurm.schedmd.com/sbatch.html) script that can be adapted to execute the script using multiple nodes and workers.
-In this example, the code called by the ``<command>`` should parallelized using BluePyParallel.
+This is an example of a [sbatch](https://slurm.schedmd.com/sbatch.html) script that can be
+adapted to execute the script using multiple nodes and workers with distributed dask and MPI.
+In this example, the code called by the ``run.py`` should be parallelized using BluePyParallel.
 
 Dask variables are not strictly required, but highly recommended, and they can be fine tuned.
 
@@ -96,9 +120,24 @@ export DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms  # Time between starti
 # Split tasks to avoid some dask errors (e.g. Event loop was unresponsive in Worker)
 export PARALLEL_BATCH_SIZE=1000
 
-srun -v <command>
+srun -v run.py
 ```
 
+To ensure only the `evaluate` function is run with parallel dask, one has to initialise the parallel factory
+before anything else is done in the code. For example, ``run.py`` could look like:
+
+```python
+if __name__ == "__main__":
+    parallel_factory = init_parallel_factory('dask_dataframe')
+    df = pd.read_csv("inuput_data.csv")
+    df = some_preprocessing(df)
+    df = evaluate(df, function_to_evaluate, parallel_factory=parallel_factory)
+    df.to_csv("output_data.csv")
+```
+
+This is because everything before `init_parallel_factory` will be run in parallel, as mpi is not initialized yet.
+
+.. note:: We recommend to use `dask_dataframe` instead of `dask`, as it is in practice more stable for large computations.
 
 ## Funding & Acknowledgment
 

Original file line number	Diff line number	Diff line change
`@@ -5,3 +5,4 @@ Adrien Berchet (@adrien-berchet)`
`5`	`5`	`# Contributors`
`6`	`6`
`7`	`7`	`* Alexis Arnaudon (@arnaudon)`
	`8`	`+* Gianluca Ficarelli (@GianlucaFicarelli)`