Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/Example for training KFP v1 #2118

Closed
wants to merge 26 commits into from

Conversation

votti
Copy link

@votti votti commented Feb 15, 2023

What this PR does / why we need it:

Currently there is no example how to do parameter tuning over whole Kubeflow pipelines (#1914).
This example shows a how parameter tuning with Katib can be done on a Kubeflow pipeline based on the pipeline v1 SDK.
It adds and uses a new Metrics Collector (kfpv1-metricscollector, #2019) that is used as a custom metrics collector in the pipeline.
The kfpv1-metricscollector image is also published to the katib docker environment.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1914, #2019

Checklist:

  • Docs included if any changes are user facing

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-cla
Copy link

google-cla bot commented Feb 15, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@votti Thanks for creating this PR, and sorry for the late response.
I left a few review comments for the first pass.

currently the docker image with the custom metrics collector lives in my personal docker. Maybe it would be better to add it to the kubeflow organisation? What would the steps be to achieve this?

Yes, we should build and push the image to our container registry.
You can add the image name and dockerfile path to the following:

strategy:
fail-fast: false
matrix:
include:
- component-name: katib-controller
dockerfile: cmd/katib-controller/v1beta1/Dockerfile
- component-name: katib-db-manager
dockerfile: cmd/db-manager/v1beta1/Dockerfile
- component-name: katib-ui
dockerfile: cmd/new-ui/v1beta1/Dockerfile
- component-name: cert-generator
dockerfile: cmd/cert-generator/v1beta1/Dockerfile
- component-name: file-metrics-collector
dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile
- component-name: tfevent-metrics-collector
dockerfile: cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile

Moreover, can you add a building script to https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/build.sh and https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/push.sh?

@votti
Copy link
Author

votti commented Mar 16, 2023

@tenzen-y Thanks a lot for your comments!
I have now weaved the kfpv1-metricscollector docker image into the build process and addressed the minor comments.

@tenzen-y
Copy link
Member

@votti Thanks for your diligent work.

We must add an E2E test for this metrics collector. However, I think we can work on it on follow-up PRs.
Which do you prefer to add E2E in this PR or other PRs?

@kubeflow/wg-automl-leads Can you approve CI?

@votti
Copy link
Author

votti commented Jun 21, 2023

Hi Kubeflow team,

First sorry to not finding time to work on this earlier.

"We must add an E2E test for this metrics collector. However, I think we can work on it on follow-up PRs.
Which do you prefer to add E2E in this PR or other PRs?"

As far as I know nobody is actively using this (yet), so I think there would be time for adding an E2E together with this PR making it easier to maintain long-term.
Would you have some pointers (eg existing tests) how this could be achieved?

@tenzen-y
Copy link
Member

@votti Thanks for the updates! I'll check this PR again later.

As far as I know nobody is actively using this (yet), so I think there would be time for adding an E2E together with this PR making it easier to maintain long-term.
Would you have some pointers (eg existing tests) how this could be achieved?

Maybe, we need to update setup-katib.sh and run-e2e-experiment.py.

  • setup-katib.sh: Install Kubeflow Pipeline.
  • run-e2e-experiment.py: Verify kubeflow integration.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I left a few comments.
Also, once adding tests are done, can you send a ping to me?

scripts/v1beta1/build.sh Outdated Show resolved Hide resolved
pool_interval=opt.poll_interval,
timout=opt.timeout,
wait_all=wait_all_processes,
completed_marked_dir=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set None to completed/marked_dir? Can we set opt.metrics_file_dir instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation on this is a bit sparse, but if I understand the code right, this would require the Kubeflow pipeline to write a file <pid>.pid with some TRAINING_COMPLETED text into this directory which it does not do:

if completed_marked_dir:
mark_file = os.path.join(completed_marked_dir, "{}.pid".format(pid))
# Check if file contains "completed" marker
with open(mark_file) as file_obj:
contents = file_obj.read()
if contents.strip() != const.TRAINING_COMPLETED:
raise Exception(
"Unable to find marker: {} in file: {} with contents: {} for pid: {}".format(
const.TRAINING_COMPLETED, mark_file, contents, pid))
# Add main pid to finished pids set

So I think None is correct here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Let me check.

votti and others added 15 commits July 18, 2023 21:58
Closesly modelled after the tfevent-metricscollector.
Currently not yet working, as there are issues
that the arguments from the `injector_webhoook`
are somehow not passed.

Addresses: kubeflow#2019
The TrialName can be parse from the pod name.

This seems currently a good way to get the trial name. For
more discussion see: kubeflow#2109
This example illustrates how a full kfp pipeline can
be tuned using Katib.

It is based on a metrics collector to collect kubeflow
pipeline metrics (kubeflow#2019). This is used as a Custom Collector.

Addresses: kubeflow#1914, kubeflow#2019
Before the notebook only worked with Python 3.11.
Now it is also tested with 3.10

Also the experiment/run name is extended with a timestamp
for easier reruns.
Otherwise the image was binarized, leading to an
artifically bad performance.
And remove an old comment
As suggested in the PR review, the generic case where multiple KFP pipeline
metrics files would be present in the output folder is supported.

Note that in the current KFP v1 implementation always only one data
file is present.
As per suggestion this should make it easier to handle the v2 metrics
collector in the future as well
This installs Kubeflow pipelines (KFP) if selected to do so in order to
run e2e tests where Katib and KFP interact.
This commit should be removed later
Vito Zanotelli added 3 commits July 18, 2023 21:58
These permissions are required such that the katib-controller can launch
argo workflows.
@votti votti force-pushed the feature/kfpv1-metricscollector branch from 25ac27f to 7d33b7b Compare July 18, 2023 20:15
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: votti
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@votti
Copy link
Author

votti commented Jul 18, 2023

Sorry for not finding more time to work on this.

Status e2e tests:
[x] extend setup_katib.sh to install Kubeflow Pipelines -> tested locally on k3s
[ ] build minimal Experiment for e2e testing (the current works but is way to extensive for a test)

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @votti!

Please can you let me know why do we need separate Metrics Collector for KFP ?
Why we can't use our default Metrics Collector in a File mode when you can specify file from which we should parse the metrics ?
In that case, Metrics Collector doesn't parse Stdout, and read data from the file.

@@ -0,0 +1,24 @@
FROM python:3.10-slim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's use the same structure as for other metrics collectors:
cmd/metricscollector/v1beta1/kfp-metricscollector/Dockerfile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich My concern is where we will put Dockerfile for KFP v2. So I would suggest we put Dockerfile for the KFP v1 on here.
wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Do we really need to support KFP v1 if, eventually, every Kubeflow users should migrate to KFP v2 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because KFP v1 and KFP v2 aren't compatible, I think migrating v1 to v2 is harder in production.
So I guess users need a lot of time to update the version.

Hence, supporting KFP v1 in Katib would be useful. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in any case I still have a question ( #2118 (review)) why do we need separate Metrics Collector for KFP if we need to just read the logs from the metrics file ?

Copy link
Member

@andreyvelich andreyvelich Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for restricting metrics file configuration in one line?

@zijianjoy Katib metrics collector parses metrics file line by line and expects metrics name and value to be located in a single line.

@votti from the log line I can see that metrics are written to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz file, isn't it ?

Btw: If you wondering why I dont just use the Stdout collector and in addition print the metrics to the log: this is because this also broke the argo command:

@votti Yeah, this could be an issue since we override the start command to make sure we redirect StdOut to /var/log/katib/metrics.log file, so Katib Metrics Collector can parse this file. Otherwise, Metrics Collector can't parse the StdOut. The main differences between StdOut and File metrics collector is that StdOut tails /var/log/katib/metrics.log file and prints logs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metrics file:
One of the complexities kubeflow pipeline manages is to handle output artifacts (usually compressing them and storing them and saving them to an s3 storage). This is what seems to be broken when using the filecollector, as something while compressing and copying the file to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz seems to go wrong.

After finding some time to look into it, I think the reason is very similar to the stdout collector:
The collector modifies the argo CMD/ARG in a way that I think causes these issues:

From the pod definition: Unmodified (eg when using the kubeflow custom metrics collector):

...
      _outputs = train_e2e(**_parsed_args)
      
    Args:

      --input-nr
      /tmp/inputs/input_nr/data
      --lr
      0.0005293023468535503
      --optimizer
      Adam
      --loss
      categorical_crossentropy
      --epochs
      3
      --batch-size
      36
      --mlpipeline-metrics
      /tmp/outputs/mlpipeline_metrics/data

When using the filecollector as metrics collector:

...
      _outputs = train_e2e(**_parsed_args)
       --input-nr /tmp/inputs/input_nr/data --lr 0.00021802007326291811 --optimizer Adam --loss categorical_crossentropy --epochs 3 --batch-size 53 --mlpipeline-metrics /tmp/outputs/mlpipeline_metrics/data && echo completed > /tmp/outputs/mlpipeline_metrics/$$$$.pid

I think this could be solved by following this proposal: #2181
Until this is fixed, I think having a custom metrics collector that does not modify the command is a necessary workaround.

Copy link
Member

@andreyvelich andreyvelich Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@votti I think this also could be solved with this feature, isn't: #577 ?
Basically, we can use Katib SDK to implement API for pushing metrics for Katib DB instead of using pull-based metrics collectors which require to change entrypoint.

User will require to report metrics in their Objective Training function.

For example:

import kubeflow.katib as katib

client = katib.KatibClient()
client.report(metrics={"accuracy": 0.9, "loss": 0.01"})

We might need to do additional changes to Katib controller to verify that metrics were reported by user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Push based metrics collection: that sounds like a good potential solution!
So the KatibClient can infer automatically which trial these metrics are associated with?

Copy link
Member

@andreyvelich andreyvelich Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@votti Currently, user can get the Trial name using ${trialSpec.Name} template in their Trial's Pod environment vars. Then, user can run KatibClient API with appropriate Trial Name to insert metrics to the Katib DB.
I think, we should always add TRIAL_NAME env to the Trial pod since it is useful for many use-cases (e.g. for exporting trained model to S3, saving Trial metrics to DB, etc.)
WDYT @tenzen-y @johnugeorge @votti ?

manifests/v1beta1/components/mysql/pvc.yaml Outdated Show resolved Hide resolved
Vito Zanotelli added 4 commits July 20, 2023 22:41
This adds a dummy e2e example that can be used to test the main
functionality.
This could be used for e2e testing
pre-commit fix Vito Zanotelli added 3 commits September 12, 2023 22:28
Otherwise the patching of the `katib-controller` cluster role would
not work.
This enables the user to set th version of the KFP
version which should be useful to use this script
to install KFP v1 and v2 without additional parameters.
This is required for kubeflow pipelines as I found no easy way
to install kubeflow pipelines into the `default` workspace
that was previously the hardcoded one.

Now the namespace can be passed as a parameter.
@votti votti force-pushed the feature/kfpv1-metricscollector branch from 2254867 to cc90afd Compare October 21, 2023 15:51
This action should now run the kubeflow pipeline v1 e2e example.

This required the extension of the `template-e2e-test` to include
parameters to
a) install kfp
b) select the `kubeflow` namespace (instead of default) to run the tests
with.
@votti votti force-pushed the feature/kfpv1-metricscollector branch from cc90afd to 582a6a7 Compare October 21, 2023 16:07
@votti
Copy link
Author

votti commented Oct 21, 2023

I tried now to implement an e2e test and checked that it runs locally (on a local k3s cluster).
What I dont know is how to trigger the ci to run the test. Could someone help me and check if the config looks right?
Thanks already!

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

This pull request has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for entire kubeflow pipelines as trial target (in addition to containers)
5 participants