Skip to content

Minutes MAP Open source product committee #2 10 09 2020

Clément Mayer edited this page Sep 16, 2020 · 1 revision

Thank you to everybody for joining the session and sharing your thoughts and ressources!

Participants

  • Amine Saboni (Octo Technology)
  • Céline Jacques (Apricity)
  • Chris-Alexandre Pena (Apricity)
  • Clément Mayer (Substra Foundation)
  • Eric Boniface (Substra Foundation)
  • Nathanaël Cretin (Substra Foundation)
  • Arthur Pignet (Substra Foundation)
  • Nicolas Landel (Substra Foundation)
  • Camille Marini (Owkin / Substra Foundation)
  • Mathieu Galtier (Owkin / Substra Foundation)
  • Romain Goussault (CHU Nantes / Substra Foundation)

Minutes

Open source product organization

The objectives of the MAP Committee are :

  1. Identify and collect features and requests for features as they arise.
  2. Ensure that these features are well expressed and understood by the community.
  3. Define which of these expressed features are more "priority", in the most desirable direction for the evolution of the framework. This enables assembling an indicative roadmap of future priorities

The committee meets quarterly, and anyone is welcome to fuel the open repository with comments, thoughts, ideas, contribution proposals, etc. Complementary status to date:

Review of the different ideas / features

Note: the issues (i.e. features ideas and requests) are assembled in a per advancement status view here (statuses are: Ideas & Requests, Some thoughts in progress, Work in progress, Dismissed ideas)

PETs complementarity / integrations

  • A first example of differential privacy combined with Substra has been developed by Fabien Gelus (former intern at Substra Foundation):
    • Exploration of tools for differential privacy by Fabien Gelus (before Opacus release by FB)
    • Use case of DP on Substra done by Fabien (with tensorflow-privacy)
  • This first example is a great step to demonstrate that is it possible to use DP libraries like tensorflow-privacy in ML configurations using Substra Framework
  • [Eric] Another tool can be explored as a use case: Opacus by Facebook, released recently, seems to have the ambition to become a new standard
  • It would be also interesting to try with a “real” use case with Substra and DP, and not only on MNIST → use case to be identified. For example, Google used Differential Privacy to monitor people mobility during COVID-19 crisis.
  • [Amine] See this list of resources on DP.
  • [Mathieu] Could this privacy be measured from a budget point of view?

Contributivity / Multi-partners learning

  • Work is in progress on contributivity workgroup (see dedicated repository). The package already offers 8 to 9 different calculation methods. No large benchmark done and no findings shared yet
  • Comments by Eric on the dedicated issue:
  • It seems possible to design a large compute plan with all the train and test tasks needed to run certain contributivity measurement methods
    • Interesting next steps:
      • quick review of first thoughts by Camille to see if Eric’s first thoughts are coherent
      • Include in Arthur’s internship roadmap to work on a “PoC” of a simple contributivity measurement approach on Substra
      • Identify another dataset in order to have other data / thoughts
  • You wanna join the discussion? You can join the public channel Slack #workgroup-mpl-contributivity and participate in the discussions and the dedicated repos.

Model lineage / End-to-end genealogy

  • The objective here is to create an identity card of a model that links it to datasets, algorithms, initial models; that includes its “genealogy”!
  • See thoughts by Clément on the dedicated issue.
  • [Camille] an early early draft in the frontend already, sort of simple identity card. Not yet clean, readable, user-friendly. Could be great to work on improving that!
  • The idea is to move forward in the different work groups on this ID card, and to see how to integrate it into the Substra Framework.
  • [Amine] About monitoring data quality, and how it impacts model performance, continuous learning setups (or regularly re-trained models).
    • [Camille] This is rather done outside the framework today but we could probably imagine integrations with external tools via queries.
  • Next step: the topic will be studied during the new season of DataForGood!

Set up a testnet of Substra and documentation

  • [Nathanaël, Chris] We succeeded today in a local setup with a very clear documentation. A setup on a node with two organizations on a virtual machine has also been done and can be made available upon request.
  • [Mathieu] It would be super interesting to have a few examples easily executables on a publicly accessible instance (e.g. OVH). Budget might be a topic though
    • [Eric] Budget could be covered by OVH sponsorship
    • [Chris] In such a setup couldn’t there be a problem of resource accumulation?.
    • Camille] It can be handled with some simple hacks to launch several learning tasks on Titanic for exemple (adding a random part on the algo script, etc.)
  • → The next step is to install a test instance on OVH in order to allow a Data Scientist to play with Substra without having to perform a deploy
  • [Amine] Possibility to package the VM on OVH or AWS marketplaces?
    • Nathanaël and Eric will look into that

Model Distillation

  • [Mathieu] Distillation is to go from a base model to a “student” model, kind of a reduced and scrambled version. Provides privacy guarantees. Well described in literature. It happens at each iteration post-processing after obtaining gradients - implementable on Substra, as additional operations in learning algos (transparent thanks to Substra being agnostic to learning algos).
  • [Eric] Are there existing libraries for that (like tensorflow-privacy and Opacus for DP)?

Composite Training

  • [Camille] all the developments made are open Source.

Advanced permissions

  • [Chris, Céline] Are there ongoing developments on that aspect? What concepts or approaches of permissions are considered?
  • [Amine] An idea: we could imagine permissions (e.g. granted by dataset providers to data scientist) conditional to a privacy budget (differential privacy epsilon)
  • [Camille] There is no new development at this stage. Owkin will work on it probably on Q4.

Data Preview

  • The objective would be to give access to a starting kit, a small percentage of data (10%) highly anonymized that would allow a data scientist to have material to prepare his algos and models.
  • Reflections have started on the subject but have not yet been implemented.
  • It would be a question of having an "additional" data set in addition to the train and test data.

Secure aggregation

  • An implementation has been done for MELLODDY Project but not open sourced at this stage.

Connect-to-partner

  • The feature is described on a dedicated issue.
  • [Eric] It’s enthralling and fun, but quite far away! We still are in an R&D, PoCs, first usages phase. This idea of feature is more for the long-term horizon of a steady growth of usage

Pre-processing of Data

  • Today, data pre-processing is carried out outside Substra, between the different partners.
  • Exemple:
    • [Romain] Healthchain project: Going back and forth between hospitals. Very manual. No traceability.
  • An advantage to do pre-processing through Substra is also to give more freedom to the data scientist to explore and test different possibilities.
  • [Céline] What could be interesting too is to give general information about the database. For example, in a project with images, it could be useful to normalize the inputs but then the mean and the variance of the train dataset should be known.
  • [Nathanaël] The fake data functionality in the opener can be useful in some cases (to be further developed in the issue comments).
  • [Eric] In multi-partner ML projects it is always needed for the different partners to align on the data pre-processing pipeline. So in all cases there has to be a workgroup that meets and discusses the target format, normalization parameters, etc. It is difficult to imagine this happening differently. So knowing that, the benefit of integrating pre-processing capabilities in Substra Framework is less obvious to me

Other topics / Misc.

  • [Chris] What should be done to optimize the use of multiple GPUs (e.g. use a minikube addon?)
    • [Eric] Don’t remember projects or use cases where this has been done. The impacts on duration / budget is complex (e.g. computing power increases but overhead of orchestration of computations too). A resource to explore: https://github.com/horovod/horovod