Updated design July 2023

Status of the prototype

As of now, our user interface is a CLI which tends to over-constrain the use cases, requiring them to re-write must of their code to be compatible with our prototype. Since in the proposal we are required to provide a Julypter-like interface, we could dismiss the current CLI in favor of some more user-friendly interface supporting Jupyter notebooks.

From the proposal, the key contributions of our task are:

Support for different ML frameworks: PyTorch and TensorFlow.
Distributed ML.
Distributed hyper-parameter optimization.
Modular framework which allows for the integration of user-defined modules.
Generalized models registry:
1. Store trained models, allowing to pull them for re-train, fine-tuning, and inference.
2. Associates trained models with their validation metrics.
3. Training metadata for reproducibility (BONUS)
Jupyter-like interface for code dev and visualizations (e.g., *.py, *.ipynb files), and graphical pipelines (e.g., Elyra).

A possible solution to solve the lack of flexibility of our current CLI and move towards the proposal consists of:

itwinai as a Python library that can be installed as pip install itwinai, which provides an abstraction layer over different ML frameworks to address requirements 1 to 4. It provides easy-to-use Trainer classes, which distribute ML training/inference and HPO in a seamless way, hiding all the complexity from the use case. By extending these Trainer classes, the use cases will be able to customize them to their specific needs, inheriting the logic for distributed ML/HPO from the parent classes without the need of understanding the technicalities. itwinai's Trainer classes will be accessed directly from within a notebook environment or from the use case's custom training script: itwinai will only provide the minimal amount of code necessary to accomplish the aforementioned tasks, requiring the use cases as little modifications as possible of their original code. itwinai will also provide an abstraction over logging, allowing different use cases
Kubeflow for requirements 4 to 6. CERN use case is using this solution, employing Kubeflow also to manage Geant4 from within a container. Apparently, Kubeflow can easily "orchestrate" containers.

Challenges of the proposed solution:

Orchestrate containers with Kubeflow has not been directly tested yet. However, an example exists here
How to offload computation to HPC systems from Kubeflow in a seamless way is still unknown. @Rakesh has some examples on how this is done in FZJ.
Not clear if Kubeflow can support different ML logging strategies, to fulfill different use cases' needs (e.g., MLFLow, WandB). It is not clear how these alternative solutions integrate with Kubeflow's native models registry and ML logs.
How to integrate with distributed data lake (shared code, ML logs, etc) is not clear.
Use a Docker container as jupyter kernel? Is it possible?
Launch a Kubeflow pipeline from an external workflow manager:
- Kubeflow Pipelines can be executed from CLI. See more functionalities of Kubeflow CLI.
- In principle, this allows to trigger Kubeflow from any workflow manager:
  - ecFlow defines tasks (steps in a workflow) as shell commands. See also this.
  - Apache Airflow: TODO

Notes:

Models registry / ML logs on Kubeflow:
- Kubeflow - WandB integration
- Kubeflow - MLFlow integration
- Kubeflow supports TensorBoards natively (default logger)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated design July 2023

Status of the prototype

Clone this wiki locally