Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for lakeFS<> mlflow integration #8599

Merged
merged 12 commits into from
Feb 6, 2025
Merged

Conversation

talSofer
Copy link
Contributor

@talSofer talSofer commented Feb 4, 2025

Closes #8598

@talSofer talSofer added docs Improvements or additions to documentation exclude-changelog PR description should not be included in next release changelog labels Feb 4, 2025
@talSofer talSofer requested a review from ozkatz February 4, 2025 11:51
Copy link

github-actions bot commented Feb 4, 2025

♻️ PR Preview 7d9d51f has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

Copy link

github-actions bot commented Feb 4, 2025

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link

github-actions bot commented Feb 4, 2025

E2E Test Results - Quickstart

11 passed

@talSofer talSofer requested a review from nopcoder February 4, 2025 12:47
Comment on lines +22 to +24
1. **Experiment Reproducibility**: By leveraging MLflow's [input logging](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input)
capabilities alongside lakeFS's data versioning, you can precisely track the specific dataset version used in each experiment
run. This ensures that experiments remain reproducible over time, even as datasets evolve.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure whether using LakeFS as a versioning engine for logging would be beneficial.

Assume mlflow tracking server where runs, metrics params and tags are store in postgres while the models, images, configs and logs are on lakeFS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with tal, I got it wrong and the data that mlflow log is the data source location which is not the logging data I thought the experiment produce.
can ignore my comment.

Comment on lines +230 to +232
To determine whether two distinct MLflow runs utilized the same input dataset, you can compare specific attributes of
their logged Dataset objects. The source attribute, which contains the versioned dataset's URI, is a common choice for
this comparison. Here's an example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how the following approach works with how mlflow compare the results? https://mlflow.org/docs/latest/getting-started/quickstart-2/index.html#compare-the-results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is orthogonal to their guide, in my opinion. We let MLflow users drill down into additional run details, which they can use to reproduce experiment results or troubleshoot.
Would it add clarity if I explained why to compare runs input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with tal and discuss the option to use the SDK to compare the actual data or provide more information about the data source from the commit.

Comment on lines +22 to +24
1. **Experiment Reproducibility**: By leveraging MLflow's [input logging](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input)
capabilities alongside lakeFS's data versioning, you can precisely track the specific dataset version used in each experiment
run. This ensures that experiments remain reproducible over time, even as datasets evolve.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with tal, I got it wrong and the data that mlflow log is the data source location which is not the logging data I thought the experiment produce.
can ignore my comment.

@talSofer talSofer merged commit 3f4af4f into master Feb 6, 2025
39 checks passed
@talSofer talSofer deleted the docs/mlflow-integration branch February 6, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation exclude-changelog PR description should not be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration docs: lakeFS & MLflow
2 participants