-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs for lakeFS<> mlflow integration #8599
Conversation
♻️ PR Preview 7d9d51f has been successfully destroyed since this PR has been closed. 🤖 By surge-preview |
1. **Experiment Reproducibility**: By leveraging MLflow's [input logging](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input) | ||
capabilities alongside lakeFS's data versioning, you can precisely track the specific dataset version used in each experiment | ||
run. This ensures that experiments remain reproducible over time, even as datasets evolve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure whether using LakeFS as a versioning engine for logging would be beneficial.
Assume mlflow tracking server where runs, metrics params and tags are store in postgres while the models, images, configs and logs are on lakeFS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked with tal, I got it wrong and the data that mlflow log is the data source location which is not the logging data I thought the experiment produce.
can ignore my comment.
To determine whether two distinct MLflow runs utilized the same input dataset, you can compare specific attributes of | ||
their logged Dataset objects. The source attribute, which contains the versioned dataset's URI, is a common choice for | ||
this comparison. Here's an example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how the following approach works with how mlflow compare the results? https://mlflow.org/docs/latest/getting-started/quickstart-2/index.html#compare-the-results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is orthogonal to their guide, in my opinion. We let MLflow users drill down into additional run details, which they can use to reproduce experiment results or troubleshoot.
Would it add clarity if I explained why to compare runs input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked with tal and discuss the option to use the SDK to compare the actual data or provide more information about the data source from the commit.
1. **Experiment Reproducibility**: By leveraging MLflow's [input logging](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input) | ||
capabilities alongside lakeFS's data versioning, you can precisely track the specific dataset version used in each experiment | ||
run. This ensures that experiments remain reproducible over time, even as datasets evolve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked with tal, I got it wrong and the data that mlflow log is the data source location which is not the logging data I thought the experiment produce.
can ignore my comment.
Closes #8598