Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc(katib): update push-based metrics collector. #3844

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 56 additions & 9 deletions content/en/docs/components/katib/user-guides/metrics-collector.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,23 @@ weight = 40

This guide describes how Katib metrics collector works.

## Metrics Collector
## Overview

There are two ways to collect metrics:

1. Pull-based: collects the metrics using a _sidecar_ container. A sidecar is a utility container that supports
the main container in the Kubernetes Pod.

2. Push-based: users push the metrics directly to Katib DB in the training scripts.

In the `metricsCollectorSpec` section of the Experiment YAML configuration file, you can
define how Katib should collect the metrics from each Trial, such as the accuracy and loss metrics.

Your training code can record the metrics into `StdOut` or into arbitrary output files. Katib
collects the metrics using a _sidecar_ container. A sidecar is a utility container that supports
the main container in the Kubernetes Pod.
## Pull-based Metrics Collector

To define the metrics collector for your Experiment:
Your training code can record the metrics into `StdOut` or into arbitrary output files.

To define the pull-based metrics collector for your Experiment:

1. Specify the collector type in the `.collector.kind` field.
Katib's metrics collector supports the following collector types:
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -29,7 +36,7 @@ To define the metrics collector for your Experiment:
metrics must be line-separated by `epoch` or `step` as follows, and the key for timestamp must
be `timestamp`:

```
```json
{"epoch": 0, "foo": "bar", "fizz": "buzz", "timestamp": "2021-12-02T14:27:51"}
{"epoch": 1, "foo": "bar", "fizz": "buzz", "timestamp": "2021-12-02T14:27:52"}
{"epoch": 2, "foo": "bar", "fizz": "buzz", "timestamp": "2021-12-02T14:27:53"}
Expand All @@ -51,9 +58,6 @@ To define the metrics collector for your Experiment:
in the `.collector.customCollector` field. Check the
[custom metrics collector example](https://github.com/kubeflow/katib/blob/ea46a7f2b73b2d316b6b7619f99eb440ede1909b/examples/v1beta1/metrics-collector/custom-metrics-collector.yaml#L14-L36).

- `None`: Specify this value if you don't need to use Katib's metrics collector. For example,
your training code may handle the persistent storage of its own metrics.

2. Write code in your training container to print or save to the file metrics in the format
specified in the `.source.filter.metricsFormat` field. The default metrics format value is:

Expand All @@ -79,3 +83,46 @@ To define the metrics collector for your Experiment:
recall=0.55
precision=.5
```

## Push-based Metrics Collector

Your training code needs to call [`report_metrics()`](https://github.com/kubeflow/katib/blob/e251a07cb9491e2d892db306d925dddf51cb0930/sdk/python/v1beta1/kubeflow/katib/api/report_metrics.py#L26) function in Python SDK to record metrics.
The `report_metrics()` function works by parsing the metrics in `metrics` field into a gRPC request, automatically adding the current timestamp for users, and sending the request to Katib DB Manager.

But before that, `kubeflow-katib` package should be installed in your training container.

To define the push-based metrics collector for your Experiment, you have two options:

- YAML File

1. Specify the collector type `Push` in the `.collector.kind` field.

2. Write code in your training container to call `report_metrics()` to report metrics.

- [`tune`](https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L166) function

Use tune function and specify the `metrics_collector_config` field. You can reference to the following example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to explain how report_metrics() functions works separately. E.g. user needs to install kubeflow-katib SDK in their environment, and the report_metrics() will automatically add the timestamp for metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I'll add the explanation to report_metrics().


```
import kubeflow.katib as katib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use simple example here and remove all unnecessary parameters and function calls.

Copy link
Member Author

@Electronic-Waste Electronic-Waste Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Sorry, I may not fully understand what you mean "unnecessary parameters and function calls".

I copied the example in the get-started chapter and replaced print with report_metrics(). Do you mean that I don't need to call the tune() function and just define a main function in the example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should just show example how user can use tune API with push-based metrics collector without any additional parameters they should set. For example:

import kubeflow.katib as katib

def objective(parameters):
  import time
  import kubeflow.katib as katib
  time.sleep(5)
  result = 4 * int(parameters["a"])
  # Push metrics to Katib DB.
  katib.report_metrics({"result": result})

katib.KatibClient().tune(
  name="push-metrics-exp",
  objective=objective,
  parameters= {"a": katib.search.int(min=10, max=20)}
  objective_metric_name="result",
  max_trial_count=2,
  metrics_collector_config={"kind": "Push"},
  # When SDK is released, replace it with packages_to_install=["kubeflow-katib==0.18.0"] 
  packages_to_install=["git+https://github.com/kubeflow/katib.git@master#subdirectory=sdk/python/v1beta1"],
)

That should allow user to focus on important changes they need to make to try this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for your clarification! I'll update the blog.


def objective(parameters):
import time
import kubeflow.katib as katib
time.sleep(5)
result = 4 * int(parameters["a"])
# Push metrics to Katib DB.
katib.report_metrics({"result": result})

katib.KatibClient(namespace="kubeflow").tune(
name="push-metrics-exp",
objective=objective,
parameters= {"a": katib.search.int(min=10, max=20)}
objective_metric_name="result",
max_trial_count=2,
metrics_collector_config={"kind": "Push"},
# When SDK is released, replace it with packages_to_install=["kubeflow-katib==0.18.0"].
# Currently, the training container should have `git` package to install this SDK.
packages_to_install=["git+https://github.com/kubeflow/katib.git@master#subdirectory=sdk/python/v1beta1"],
)
```