Feature/Example for training KFP v1 #2118

votti · 2023-02-15T11:23:30Z

What this PR does / why we need it:

Currently there is no example how to do parameter tuning over whole Kubeflow pipelines (#1914).
This example shows a how parameter tuning with Katib can be done on a Kubeflow pipeline based on the pipeline v1 SDK.
It adds and uses a new Metrics Collector (kfpv1-metricscollector, #2019) that is used as a custom metrics collector in the pipeline.
The kfpv1-metricscollector image is also published to the katib docker environment.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1914, #2019

Checklist:

Docs included if any changes are user facing

review-notebook-app · 2023-02-15T11:23:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

google-cla · 2023-02-15T11:23:35Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

tenzen-y

@votti Thanks for creating this PR, and sorry for the late response.
I left a few review comments for the first pass.

currently the docker image with the custom metrics collector lives in my personal docker. Maybe it would be better to add it to the kubeflow organisation? What would the steps be to achieve this?

Yes, we should build and push the image to our container registry.
You can add the image name and dockerfile path to the following:

katib/.github/workflows/publish-core-images.yaml

Lines 19 to 34 in b6afce7

    
           strategy: 
        
             fail-fast: false 
        
             matrix: 
        
               include: 
        
                 - component-name: katib-controller 
        
                   dockerfile: cmd/katib-controller/v1beta1/Dockerfile 
        
                 - component-name: katib-db-manager 
        
                   dockerfile: cmd/db-manager/v1beta1/Dockerfile 
        
                 - component-name: katib-ui 
        
                   dockerfile: cmd/new-ui/v1beta1/Dockerfile 
        
                 - component-name: cert-generator 
        
                   dockerfile: cmd/cert-generator/v1beta1/Dockerfile 
        
                 - component-name: file-metrics-collector 
        
                   dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile 
        
                 - component-name: tfevent-metrics-collector 
        
                   dockerfile: cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile

Moreover, can you add a building script to https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/build.sh and https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/push.sh?

cmd/metricscollector/v1beta1/kfpv1-metricscollector/Dockerfile

cmd/metricscollector/v1beta1/kfpv1-metricscollector/main.py

pkg/metricscollector/v1beta1/common/const.py

pkg/metricscollector/v1beta1/kfpv1-metricscollector/metrics_loader.py

votti · 2023-03-16T08:08:56Z

@tenzen-y Thanks a lot for your comments!
I have now weaved the kfpv1-metricscollector docker image into the build process and addressed the minor comments.

pkg/metricscollector/v1beta1/common/const.py

pkg/metricscollector/v1beta1/kfpv1-metricscollector/metrics_loader.py

cmd/metricscollector/v1beta1/kfpv1-metricscollector/main.py

pkg/metricscollector/v1beta1/kfpv1-metricscollector/metrics_loader.py

tenzen-y · 2023-03-16T09:47:09Z

@votti Thanks for your diligent work.

We must add an E2E test for this metrics collector. However, I think we can work on it on follow-up PRs.
Which do you prefer to add E2E in this PR or other PRs?

@kubeflow/wg-automl-leads Can you approve CI?

examples/v1beta1/kubeflow-pipelines/README.md

.github/workflows/publish-core-images.yaml

votti · 2023-06-21T13:30:58Z

Hi Kubeflow team,

First sorry to not finding time to work on this earlier.

"We must add an E2E test for this metrics collector. However, I think we can work on it on follow-up PRs.
Which do you prefer to add E2E in this PR or other PRs?"

As far as I know nobody is actively using this (yet), so I think there would be time for adding an E2E together with this PR making it easier to maintain long-term.
Would you have some pointers (eg existing tests) how this could be achieved?

tenzen-y · 2023-06-21T17:01:51Z

@votti Thanks for the updates! I'll check this PR again later.

As far as I know nobody is actively using this (yet), so I think there would be time for adding an E2E together with this PR making it easier to maintain long-term.
Would you have some pointers (eg existing tests) how this could be achieved?

Maybe, we need to update setup-katib.sh and run-e2e-experiment.py.

setup-katib.sh: Install Kubeflow Pipeline.
run-e2e-experiment.py: Verify kubeflow integration.

tenzen-y

Thanks for the updates! I left a few comments.
Also, once adding tests are done, can you send a ping to me?

scripts/v1beta1/build.sh

tenzen-y · 2023-06-30T18:15:41Z

cmd/metricscollector/v1beta1/kfp-metricscollector/v1/main.py

+        pool_interval=opt.poll_interval,
+        timout=opt.timeout,
+        wait_all=wait_all_processes,
+        completed_marked_dir=None,


Why do we set None to completed/marked_dir? Can we set opt.metrics_file_dir instead?

The documentation on this is a bit sparse, but if I understand the code right, this would require the Kubeflow pipeline to write a file <pid>.pid with some TRAINING_COMPLETED text into this directory which it does not do:

katib/pkg/metricscollector/v1beta1/common/pns.py

Lines 95 to 104 in f740889

if completed_marked_dir:

mark_file = os.path.join(completed_marked_dir, "{}.pid".format(pid))

# Check if file contains "completed" marker

with open(mark_file) as file_obj:

contents = file_obj.read()

if contents.strip() != const.TRAINING_COMPLETED:

raise Exception(

"Unable to find marker: {} in file: {} with contents: {} for pid: {}".format(

const.TRAINING_COMPLETED, mark_file, contents, pid))

# Add main pid to finished pids set

So I think None is correct here.

Thanks for the explanation. Let me check.

pkg/metricscollector/v1beta1/kfp-metricscollector/v1/metrics_loader.py

Closesly modelled after the tfevent-metricscollector. Currently not yet working, as there are issues that the arguments from the `injector_webhoook` are somehow not passed. Addresses: kubeflow#2019

The TrialName can be parse from the pod name. This seems currently a good way to get the trial name. For more discussion see: kubeflow#2109

This example illustrates how a full kfp pipeline can be tuned using Katib. It is based on a metrics collector to collect kubeflow pipeline metrics (kubeflow#2019). This is used as a Custom Collector. Addresses: kubeflow#1914, kubeflow#2019

Before the notebook only worked with Python 3.11. Now it is also tested with 3.10 Also the experiment/run name is extended with a timestamp for easier reruns.

Otherwise the image was binarized, leading to an artifically bad performance.

And remove an old comment

Co-authored-by: axel7083 <[email protected]>

As per suggestion

As suggested in the PR review, the generic case where multiple KFP pipeline metrics files would be present in the output folder is supported. Note that in the current KFP v1 implementation always only one data file is present.

As per suggestion this should make it easier to handle the v2 metrics collector in the future as well

This installs Kubeflow pipelines (KFP) if selected to do so in order to run e2e tests where Katib and KFP interact.

This commit should be removed later

These permissions are required such that the katib-controller can launch argo workflows.

google-oss-prow · 2023-07-18T20:15:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: votti
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

votti · 2023-07-18T20:24:25Z

Sorry for not finding more time to work on this.

Status e2e tests:
[x] extend setup_katib.sh to install Kubeflow Pipelines -> tested locally on k3s
[ ] build minimal Experiment for e2e testing (the current works but is way to extensive for a test)

andreyvelich

Thank you for this great contribution @votti!

Please can you let me know why do we need separate Metrics Collector for KFP ?
Why we can't use our default Metrics Collector in a File mode when you can specify file from which we should parse the metrics ?
In that case, Metrics Collector doesn't parse Stdout, and read data from the file.

andreyvelich · 2023-07-19T11:52:00Z

cmd/metricscollector/v1beta1/kfp-metricscollector/v1/Dockerfile

@@ -0,0 +1,24 @@
+FROM python:3.10-slim


Please let's use the same structure as for other metrics collectors:
cmd/metricscollector/v1beta1/kfp-metricscollector/Dockerfile

@andreyvelich My concern is where we will put Dockerfile for KFP v2. So I would suggest we put Dockerfile for the KFP v1 on here.
wdyt?

Oh, I see. Do we really need to support KFP v1 if, eventually, every Kubeflow users should migrate to KFP v2 ?

Because KFP v1 and KFP v2 aren't compatible, I think migrating v1 to v2 is harder in production.
So I guess users need a lot of time to update the version.

Hence, supporting KFP v1 in Katib would be useful. WDYT?

I see, in any case I still have a question ( #2118 (review)) why do we need separate Metrics Collector for KFP if we need to just read the logs from the metrics file ?

Is there any reason for restricting metrics file configuration in one line?

@zijianjoy Katib metrics collector parses metrics file line by line and expects metrics name and value to be located in a single line.

@votti from the log line I can see that metrics are written to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz file, isn't it ?

Btw: If you wondering why I dont just use the Stdout collector and in addition print the metrics to the log: this is because this also broke the argo command:

@votti Yeah, this could be an issue since we override the start command to make sure we redirect StdOut to /var/log/katib/metrics.log file, so Katib Metrics Collector can parse this file. Otherwise, Metrics Collector can't parse the StdOut. The main differences between StdOut and File metrics collector is that StdOut tails /var/log/katib/metrics.log file and prints logs.

@metrics file:
One of the complexities kubeflow pipeline manages is to handle output artifacts (usually compressing them and storing them and saving them to an s3 storage). This is what seems to be broken when using the filecollector, as something while compressing and copying the file to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz seems to go wrong.

After finding some time to look into it, I think the reason is very similar to the stdout collector:
The collector modifies the argo CMD/ARG in a way that I think causes these issues:

From the pod definition: Unmodified (eg when using the kubeflow custom metrics collector):

... _outputs = train_e2e(**_parsed_args) Args: --input-nr /tmp/inputs/input_nr/data --lr 0.0005293023468535503 --optimizer Adam --loss categorical_crossentropy --epochs 3 --batch-size 36 --mlpipeline-metrics /tmp/outputs/mlpipeline_metrics/data

When using the filecollector as metrics collector:

... _outputs = train_e2e(**_parsed_args) --input-nr /tmp/inputs/input_nr/data --lr 0.00021802007326291811 --optimizer Adam --loss categorical_crossentropy --epochs 3 --batch-size 53 --mlpipeline-metrics /tmp/outputs/mlpipeline_metrics/data && echo completed > /tmp/outputs/mlpipeline_metrics/$$$$.pid

I think this could be solved by following this proposal: #2181
Until this is fixed, I think having a custom metrics collector that does not modify the command is a necessary workaround.

@votti I think this also could be solved with this feature, isn't: #577 ?
Basically, we can use Katib SDK to implement API for pushing metrics for Katib DB instead of using pull-based metrics collectors which require to change entrypoint.

User will require to report metrics in their Objective Training function.

For example:

import kubeflow.katib as katib client = katib.KatibClient() client.report(metrics={"accuracy": 0.9, "loss": 0.01"})

We might need to do additional changes to Katib controller to verify that metrics were reported by user.

@Push based metrics collection: that sounds like a good potential solution!
So the KatibClient can infer automatically which trial these metrics are associated with?

@votti Currently, user can get the Trial name using ${trialSpec.Name} template in their Trial's Pod environment vars. Then, user can run KatibClient API with appropriate Trial Name to insert metrics to the Katib DB.
I think, we should always add TRIAL_NAME env to the Trial pod since it is useful for many use-cases (e.g. for exporting trained model to S3, saving Trial metrics to DB, etc.)
WDYT @tenzen-y @johnugeorge @votti ?

manifests/v1beta1/components/mysql/pvc.yaml

This adds a dummy e2e example that can be used to test the main functionality.

This reverts commit 36ed372.

This could be used for e2e testing

Otherwise the patching of the `katib-controller` cluster role would not work.

This enables the user to set th version of the KFP version which should be useful to use this script to install KFP v1 and v2 without additional parameters.

This is required for kubeflow pipelines as I found no easy way to install kubeflow pipelines into the `default` workspace that was previously the hardcoded one. Now the namespace can be passed as a parameter.

This action should now run the kubeflow pipeline v1 e2e example. This required the extension of the `template-e2e-test` to include parameters to a) install kfp b) select the `kubeflow` namespace (instead of default) to run the tests with.

votti · 2023-10-21T16:09:34Z

I tried now to implement an e2e test and checked that it runs locally (on a local k3s cluster).
What I dont know is how to trigger the ci to run the test. Could someone help me and check if the config looks right?
Thanks already!

github-actions · 2024-01-22T20:05:09Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-02-11T20:05:17Z

This pull request has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

google-oss-prow bot requested review from andreyvelich, anencore94 and tenzen-y February 15, 2023 11:23

google-oss-prow bot added the size/XXL label Feb 15, 2023

tenzen-y reviewed Feb 22, 2023

View reviewed changes

tenzen-y reviewed Mar 16, 2023

View reviewed changes

axel7083 reviewed Jun 13, 2023

View reviewed changes

examples/v1beta1/kubeflow-pipelines/README.md Outdated Show resolved Hide resolved

.github/workflows/publish-core-images.yaml Outdated Show resolved Hide resolved

axel7083 mentioned this pull request Jun 30, 2023

feature: Adding trial name as env for metrics collectors #2165

Closed

1 task

tenzen-y reviewed Jun 30, 2023

View reviewed changes

votti and others added 15 commits July 18, 2023 21:58

Adds a first draft of a kfpv1-metricscollector

03fa850

Closesly modelled after the tfevent-metricscollector. Currently not yet working, as there are issues that the arguments from the `injector_webhoook` are somehow not passed. Addresses: kubeflow#2019

Use PodName as input

8918473

The TrialName can be parse from the pod name. This seems currently a good way to get the trial name. For more discussion see: kubeflow#2109

Adds python < 3.11 compatiblity

e9a0051

Before the notebook only worked with Python 3.11. Now it is also tested with 3.10 Also the experiment/run name is extended with a timestamp for easier reruns.

Add histogram equalization before rescaling

17123d6

Otherwise the image was binarized, leading to an artifically bad performance.

Update copyright date

4f19db8

And remove an old comment

Update python version

9f83b0f

Publish the docker image in kubeflowkatib

61e77ea

Fix suggested typo fixes

88c20c3

Co-authored-by: axel7083 <[email protected]>

Move KFP V1 metrics collector docker files to v1 subfolder

904d07d

As per suggestion

Support loading of folder of metrics collector files

31655dd

As suggested in the PR review, the generic case where multiple KFP pipeline metrics files would be present in the output folder is supported. Note that in the current KFP v1 implementation always only one data file is present.

Move kfpv1 metricscollector in v1 subfolder

c458541

As per suggestion this should make it easier to handle the v2 metrics collector in the future as well

Remove duplicated notebook section

cee9970

Add dependencies for KFPv1 e2e testing

f7e697b

This installs Kubeflow pipelines (KFP) if selected to do so in order to run e2e tests where Katib and KFP interact.

TMP: changes to run tests locally

36ed372

This commit should be removed later

Vito Zanotelli added 3 commits July 18, 2023 21:58

Add missing ClusterRole update

15c4a4b

These permissions are required such that the katib-controller can launch argo workflows.

Remove accidentally included self

741059f

Rename paramater to more meaningful name

7d33b7b

votti force-pushed the feature/kfpv1-metricscollector branch from 25ac27f to 7d33b7b Compare July 18, 2023 20:15

andreyvelich reviewed Jul 19, 2023

View reviewed changes

Vito Zanotelli added 4 commits July 20, 2023 22:41

Extend example notebook with simple example for e2e tests

35df815

This adds a dummy e2e example that can be used to test the main functionality.

Revert "TMP: changes to run tests locally"

0504085

This reverts commit 36ed372.

Adds spec of a simple kfp1+katib experiment spec

4cddd3e

This could be used for e2e testing

Update psutil version to fix Docker build error

6a0bdd3

andreyvelich mentioned this pull request Aug 24, 2023

A metrics collector for Kubeflow Pipeline Metrics artifacts #2019

Open

pre-commit fix Vito Zanotelli added 3 commits September 12, 2023 22:28

Move kubeflow installation after katib

182b787

Otherwise the patching of the `katib-controller` cluster role would not work.

Parametrize kubeflow version

9fc7c02

This enables the user to set th version of the KFP version which should be useful to use this script to install KFP v1 and v2 without additional parameters.

Add namespace parameter

579546c

This is required for kubeflow pipelines as I found no easy way to install kubeflow pipelines into the `default` workspace that was previously the hardcoded one. Now the namespace can be passed as a parameter.

votti force-pushed the feature/kfpv1-metricscollector branch from 2254867 to cc90afd Compare October 21, 2023 15:51

Add kfpv1 e2e test

582a6a7

This action should now run the kubeflow pipeline v1 e2e example. This required the extension of the `template-e2e-test` to include parameters to a) install kfp b) select the `kubeflow` namespace (instead of default) to run the tests with.

votti force-pushed the feature/kfpv1-metricscollector branch from cc90afd to 582a6a7 Compare October 21, 2023 16:07

github-actions bot added the lifecycle/stale label Jan 22, 2024

github-actions bot closed this Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/Example for training KFP v1 #2118

Feature/Example for training KFP v1 #2118

votti commented Feb 15, 2023 •

edited

Loading

review-notebook-app bot commented Feb 15, 2023

google-cla bot commented Feb 15, 2023

tenzen-y left a comment

votti commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

votti commented Jun 21, 2023

tenzen-y commented Jun 21, 2023

tenzen-y left a comment

tenzen-y Jun 30, 2023

votti Jul 18, 2023

tenzen-y Jul 19, 2023

google-oss-prow bot commented Jul 18, 2023

votti commented Jul 18, 2023

andreyvelich left a comment

andreyvelich Jul 19, 2023

tenzen-y Jul 19, 2023

andreyvelich Jul 20, 2023

tenzen-y Jul 20, 2023

andreyvelich Jul 20, 2023

andreyvelich Jul 26, 2023 •

edited

Loading

votti Aug 27, 2023

andreyvelich Aug 29, 2023 •

edited

Loading

votti Oct 21, 2023

andreyvelich Oct 24, 2023 •

edited

Loading

votti commented Oct 21, 2023

github-actions bot commented Jan 22, 2024

github-actions bot commented Feb 11, 2024

	strategy:
	fail-fast: false
	matrix:
	include:
	- component-name: katib-controller
	dockerfile: cmd/katib-controller/v1beta1/Dockerfile
	- component-name: katib-db-manager
	dockerfile: cmd/db-manager/v1beta1/Dockerfile
	- component-name: katib-ui
	dockerfile: cmd/new-ui/v1beta1/Dockerfile
	- component-name: cert-generator
	dockerfile: cmd/cert-generator/v1beta1/Dockerfile
	- component-name: file-metrics-collector
	dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile
	- component-name: tfevent-metrics-collector
	dockerfile: cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile

	if completed_marked_dir:
	mark_file = os.path.join(completed_marked_dir, "{}.pid".format(pid))
	# Check if file contains "completed" marker
	with open(mark_file) as file_obj:
	contents = file_obj.read()
	if contents.strip() != const.TRAINING_COMPLETED:
	raise Exception(
	"Unable to find marker: {} in file: {} with contents: {} for pid: {}".format(
	const.TRAINING_COMPLETED, mark_file, contents, pid))
	# Add main pid to finished pids set

Feature/Example for training KFP v1 #2118

Feature/Example for training KFP v1 #2118

Conversation

votti commented Feb 15, 2023 • edited Loading

review-notebook-app bot commented Feb 15, 2023

google-cla bot commented Feb 15, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

votti commented Mar 16, 2023

tenzen-y commented Mar 16, 2023

votti commented Jun 21, 2023

tenzen-y commented Jun 21, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jul 18, 2023

votti commented Jul 18, 2023

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

votti commented Oct 21, 2023

github-actions bot commented Jan 22, 2024

github-actions bot commented Feb 11, 2024

votti commented Feb 15, 2023 •

edited

Loading

andreyvelich Jul 26, 2023 •

edited

Loading

andreyvelich Aug 29, 2023 •

edited

Loading

andreyvelich Oct 24, 2023 •

edited

Loading