Minutes MAP Open source product committee #2 10 09 2020

Thank you to everybody for joining the session and sharing your thoughts and ressources!

Participants

Amine Saboni (Octo Technology)
Céline Jacques (Apricity)
Chris-Alexandre Pena (Apricity)
Clément Mayer (Substra Foundation)
Eric Boniface (Substra Foundation)
Nathanaël Cretin (Substra Foundation)
Arthur Pignet (Substra Foundation)
Nicolas Landel (Substra Foundation)
Camille Marini (Owkin / Substra Foundation)
Mathieu Galtier (Owkin / Substra Foundation)
Romain Goussault (CHU Nantes / Substra Foundation)

Minutes

Open source product organization

The objectives of the MAP Committee are :

Identify and collect features and requests for features as they arise.
Ensure that these features are well expressed and understood by the community.
Define which of these expressed features are more "priority", in the most desirable direction for the evolution of the framework. This enables assembling an indicative roadmap of future priorities

The committee meets quarterly, and anyone is welcome to fuel the open repository with comments, thoughts, ideas, contribution proposals, etc. Complementary status to date:

2 examples of Substra usage have been developed and made available (Deepfake detection, MNIST with differential privacy)
A local testnet with 2 nodes is documented and can be made available for quick demonstrations

Review of the different ideas / features

Note: the issues (i.e. features ideas and requests) are assembled in a per advancement status view here (statuses are: Ideas & Requests, Some thoughts in progress, Work in progress, Dismissed ideas)

PETs complementarity / integrations

A first example of differential privacy combined with Substra has been developed by Fabien Gelus (former intern at Substra Foundation):
- Exploration of tools for differential privacy by Fabien Gelus (before Opacus release by FB)
- Use case of DP on Substra done by Fabien (with tensorflow-privacy)
This first example is a great step to demonstrate that is it possible to use DP libraries like tensorflow-privacy in ML configurations using Substra Framework
[Eric] Another tool can be explored as a use case: Opacus by Facebook, released recently, seems to have the ambition to become a new standard
It would be also interesting to try with a “real” use case with Substra and DP, and not only on MNIST → use case to be identified. For example, Google used Differential Privacy to monitor people mobility during COVID-19 crisis.
[Amine] See this list of resources on DP.
[Mathieu] Could this privacy be measured from a budget point of view?
- [Amine] Some interesting resources have been raised: Automatic Discovery of Privacy-Utility Pareto Fronts

Contributivity / Multi-partners learning

Work is in progress on contributivity workgroup (see dedicated repository). The package already offers 8 to 9 different calculation methods. No large benchmark done and no findings shared yet
Comments by Eric on the dedicated issue:
It seems possible to design a large compute plan with all the train and test tasks needed to run certain contributivity measurement methods
- Interesting next steps:
  - quick review of first thoughts by Camille to see if Eric’s first thoughts are coherent
  - Include in Arthur’s internship roadmap to work on a “PoC” of a simple contributivity measurement approach on Substra
  - Identify another dataset in order to have other data / thoughts
You wanna join the discussion? You can join the public channel Slack #workgroup-mpl-contributivity and participate in the discussions and the dedicated repos.

Model lineage / End-to-end genealogy

The objective here is to create an identity card of a model that links it to datasets, algorithms, initial models; that includes its “genealogy”!
See thoughts by Clément on the dedicated issue.
[Camille] an early early draft in the frontend already, sort of simple identity card. Not yet clean, readable, user-friendly. Could be great to work on improving that!
The idea is to move forward in the different work groups on this ID card, and to see how to integrate it into the Substra Framework.
[Amine] About monitoring data quality, and how it impacts model performance, continuous learning setups (or regularly re-trained models).
- [Camille] This is rather done outside the framework today but we could probably imagine integrations with external tools via queries.
Next step: the topic will be studied during the new season of DataForGood!

Set up a testnet of Substra and documentation

[Nathanaël, Chris] We succeeded today in a local setup with a very clear documentation. A setup on a node with two organizations on a virtual machine has also been done and can be made available upon request.
[Mathieu] It would be super interesting to have a few examples easily executables on a publicly accessible instance (e.g. OVH). Budget might be a topic though
- [Eric] Budget could be covered by OVH sponsorship
- [Chris] In such a setup couldn’t there be a problem of resource accumulation?.
- Camille] It can be handled with some simple hacks to launch several learning tasks on Titanic for exemple (adding a random part on the algo script, etc.)
→ The next step is to install a test instance on OVH in order to allow a Data Scientist to play with Substra without having to perform a deploy
[Amine] Possibility to package the VM on OVH or AWS marketplaces?
- Nathanaël and Eric will look into that

Model Distillation

[Mathieu] Distillation is to go from a base model to a “student” model, kind of a reduced and scrambled version. Provides privacy guarantees. Well described in literature. It happens at each iteration post-processing after obtaining gradients - implementable on Substra, as additional operations in learning algos (transparent thanks to Substra being agnostic to learning algos).
[Eric] Are there existing libraries for that (like tensorflow-privacy and Opacus for DP)?
- Mathieu has thought about the basics but no dev at this stage
- [Romain] It is also interesting to identify what already exists. For example: https://nervanasystems.github.io/distiller/index.html

Composite Training

[Camille] all the developments made are open Source.

Advanced permissions

[Chris, Céline] Are there ongoing developments on that aspect? What concepts or approaches of permissions are considered?
[Amine] An idea: we could imagine permissions (e.g. granted by dataset providers to data scientist) conditional to a privacy budget (differential privacy epsilon)
[Camille] There is no new development at this stage. Owkin will work on it probably on Q4.

Data Preview

The objective would be to give access to a starting kit, a small percentage of data (10%) highly anonymized that would allow a data scientist to have material to prepare his algos and models.
Reflections have started on the subject but have not yet been implemented.
It would be a question of having an "additional" data set in addition to the train and test data.

Secure aggregation

An implementation has been done for MELLODDY Project but not open sourced at this stage.

Connect-to-partner

The feature is described on a dedicated issue.
[Eric] It’s enthralling and fun, but quite far away! We still are in an R&D, PoCs, first usages phase. This idea of feature is more for the long-term horizon of a steady growth of usage

Pre-processing of Data

Today, data pre-processing is carried out outside Substra, between the different partners.
Exemple:
- [Romain] Healthchain project: Going back and forth between hospitals. Very manual. No traceability.
An advantage to do pre-processing through Substra is also to give more freedom to the data scientist to explore and test different possibilities.
[Céline] What could be interesting too is to give general information about the database. For example, in a project with images, it could be useful to normalize the inputs but then the mean and the variance of the train dataset should be known.
[Nathanaël] The fake data functionality in the opener can be useful in some cases (to be further developed in the issue comments).
[Eric] In multi-partner ML projects it is always needed for the different partners to align on the data pre-processing pipeline. So in all cases there has to be a workgroup that meets and discusses the target format, normalization parameters, etc. It is difficult to imagine this happening differently. So knowing that, the benefit of integrating pre-processing capabilities in Substra Framework is less obvious to me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly