Add docs for lakeFS<> mlflow integration #8599

talSofer · 2025-02-04T11:51:30Z

Closes #8598

github-actions · 2025-02-04T11:52:53Z

♻️ PR Preview 7d9d51f has been successfully destroyed since this PR has been closed.

_{🤖 By surge-preview}

github-actions · 2025-02-04T11:56:32Z

E2E Test Results - DynamoDB Local - Local Block Adapter

github-actions · 2025-02-04T11:58:52Z

E2E Test Results - Quickstart

nopcoder · 2025-02-05T11:51:36Z

docs/integrations/mlflow.md

+1. **Experiment Reproducibility**: By leveraging MLflow's [input logging](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input)
+capabilities alongside lakeFS's data versioning, you can precisely track the specific dataset version used in each experiment
+run. This ensures that experiments remain reproducible over time, even as datasets evolve.


I'm unsure whether using LakeFS as a versioning engine for logging would be beneficial.

Assume mlflow tracking server where runs, metrics params and tags are store in postgres while the models, images, configs and logs are on lakeFS

talked with tal, I got it wrong and the data that mlflow log is the data source location which is not the logging data I thought the experiment produce.
can ignore my comment.

nopcoder · 2025-02-05T12:20:03Z

docs/integrations/mlflow.md

+To determine whether two distinct MLflow runs utilized the same input dataset, you can compare specific attributes of 
+their logged Dataset objects. The source attribute, which contains the versioned dataset's URI, is a common choice for 
+this comparison. Here's an example:


how the following approach works with how mlflow compare the results? https://mlflow.org/docs/latest/getting-started/quickstart-2/index.html#compare-the-results

This is orthogonal to their guide, in my opinion. We let MLflow users drill down into additional run details, which they can use to reproduce experiment results or troubleshoot.
Would it add clarity if I explained why to compare runs input?

talked with tal and discuss the option to use the SDK to compare the actual data or provide more information about the data source from the commit.

nopcoder · 2025-02-05T13:49:34Z

docs/integrations/mlflow.md

+1. **Experiment Reproducibility**: By leveraging MLflow's [input logging](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input)
+capabilities alongside lakeFS's data versioning, you can precisely track the specific dataset version used in each experiment
+run. This ensures that experiments remain reproducible over time, even as datasets evolve.


talked with tal, I got it wrong and the data that mlflow log is the data source location which is not the logging data I thought the experiment produce.
can ignore my comment.

talSofer added 10 commits February 2, 2025 10:48

docs structure

2166046

tmp docs

3cd21be

tmp2

e71da19

working spark example

5009f59

complete spark example

57cb5cb

complete Python example

9557722

good structure

e4197a8

intro

58c14fe

how-to section

98a9cef

final with logo

d21c6ed

talSofer added docs Improvements or additions to documentation exclude-changelog PR description should not be included in next release changelog labels Feb 4, 2025

talSofer requested a review from ozkatz February 4, 2025 11:51

talSofer requested a review from nopcoder February 4, 2025 12:47

Merge branch 'master' into docs/mlflow-integration

025e83c

nopcoder requested changes Feb 5, 2025

View reviewed changes

nopcoder approved these changes Feb 5, 2025

View reviewed changes

add lakefs next steps to runs comparison

7d9d51f

talSofer merged commit 3f4af4f into master Feb 6, 2025
39 checks passed

talSofer deleted the docs/mlflow-integration branch February 6, 2025 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs for lakeFS<> mlflow integration #8599

Add docs for lakeFS<> mlflow integration #8599

talSofer commented Feb 4, 2025

github-actions bot commented Feb 4, 2025 •

edited

Loading

github-actions bot commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

nopcoder Feb 5, 2025

nopcoder Feb 5, 2025

nopcoder Feb 5, 2025

talSofer Feb 5, 2025

nopcoder Feb 5, 2025

nopcoder Feb 5, 2025

Add docs for lakeFS<> mlflow integration #8599

Add docs for lakeFS<> mlflow integration #8599

Conversation

talSofer commented Feb 4, 2025

github-actions bot commented Feb 4, 2025 • edited Loading

github-actions bot commented Feb 4, 2025

E2E Test Results - DynamoDB Local - Local Block Adapter

github-actions bot commented Feb 4, 2025

E2E Test Results - Quickstart

nopcoder Feb 5, 2025

Choose a reason for hiding this comment

nopcoder Feb 5, 2025

Choose a reason for hiding this comment

nopcoder Feb 5, 2025

Choose a reason for hiding this comment

talSofer Feb 5, 2025

Choose a reason for hiding this comment

nopcoder Feb 5, 2025

Choose a reason for hiding this comment

nopcoder Feb 5, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 4, 2025 •

edited

Loading