Skip to content

Commit

Permalink
Create Kubeflow Integration (#18391)
Browse files Browse the repository at this point in the history
* inital set kubeflow

* set config file kubeflow

* implmentation

* metrics

* validation

* labeler

* dash

* monitor

* check

* test typo

* linter

* kubeflow assets

* add units

* fix unit

* fix labler

* fix labler

* fix labler
  • Loading branch information
HadhemiDD authored Sep 11, 2024
1 parent 0ae2279 commit fe2c262
Show file tree
Hide file tree
Showing 33 changed files with 3,186 additions and 0 deletions.
9 changes: 9 additions & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,10 @@ coverage:
target: 75
flags:
- kube_metrics_server
Kubeflow:
target: 75
flags:
- kubeflow
Kubelet:
target: 75
flags:
Expand Down Expand Up @@ -1138,6 +1142,11 @@ flags:
paths:
- kube_scheduler/datadog_checks/kube_scheduler
- kube_scheduler/tests
kubeflow:
carryforward: true
paths:
- kubeflow/datadog_checks/kubeflow
- kubeflow/tests
kubelet:
carryforward: true
paths:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/config/labeler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,8 @@ integration/kube_proxy:
- kube_proxy/**/*
integration/kube_scheduler:
- kube_scheduler/**/*
integration/kubeflow:
- kubeflow/**/*
integration/kubelet:
- kubelet/**/*
integration/kubernetes:
Expand Down
20 changes: 20 additions & 0 deletions .github/workflows/test-all.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2094,6 +2094,26 @@ jobs:
minimum-base-package: ${{ inputs.minimum-base-package }}
pytest-args: ${{ inputs.pytest-args }}
secrets: inherit
j89c297c:
uses: ./.github/workflows/test-target.yml
with:
job-name: Kubeflow
target: kubeflow
platform: linux
runner: '["ubuntu-22.04"]'
repo: "${{ inputs.repo }}"
python-version: "${{ inputs.python-version }}"
standard: ${{ inputs.standard }}
latest: ${{ inputs.latest }}
agent-image: "${{ inputs.agent-image }}"
agent-image-py2: "${{ inputs.agent-image-py2 }}"
agent-image-windows: "${{ inputs.agent-image-windows }}"
agent-image-windows-py2: "${{ inputs.agent-image-windows-py2 }}"
test-py2: ${{ inputs.test-py2 }}
test-py3: ${{ inputs.test-py3 }}
minimum-base-package: ${{ inputs.minimum-base-package }}
pytest-args: ${{ inputs.pytest-args }}
secrets: inherit
j24a5cff:
uses: ./.github/workflows/test-target.yml
with:
Expand Down
4 changes: 4 additions & 0 deletions kubeflow/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# CHANGELOG - Kubeflow

<!-- towncrier release notes start -->

92 changes: 92 additions & 0 deletions kubeflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Agent Check: Kubeflow

## Overview

This check monitors [Kubeflow][1] through the Datadog Agent.


## Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates][3] for guidance on applying these instructions.

### Installation

The Kubeflow check is included in the [Datadog Agent][2] package.
No additional installation is needed on your server.

### Configuration

1. Edit the `kubeflow.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your kubeflow performance data. See the [sample kubeflow.d/conf.yaml][4] for all available configuration options.

2. [Restart the Agent][5].

#### Metric collection

Make sure that the Prometheus-formatted metrics are exposed for your `kubeflow` componenet.
For the Agent to start collecting metrics, the `kubeflow` pods need to be annotated.

Kubeflow has metrics endpoints that can be accessed on port `9090`.

**Note**: The listed metrics can only be collected if they are available(depending on the version). Some metrics are generated only when certain actions are performed.

The only parameter required for configuring the `kubeflow` check is `openmetrics_endpoint`. This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is `9090`. In containerized environments, `%%host%%` should be used for [host autodetection][3].

```yaml
apiVersion: v1
kind: Pod
# (...)
metadata:
name: '<POD_NAME>'
annotations:
ad.datadoghq.com/controller.checks: |
{
"kubeflow": {
"init_config": {},
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:9090/metrics"
}
]
}
}
# (...)
spec:
containers:
- name: 'controller'
# (...)
```

### Validation

[Run the Agent's status subcommand][6] and look for `kubeflow` under the Checks section.

## Data Collected

### Metrics

See [metadata.csv][7] for a list of metrics provided by this integration.

### Events

The Kubeflow integration does not include any events.

### Service Checks

The Kubeflow integration does not include any service checks.

See [service_checks.json][8] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][9].


[1]: **LINK_TO_INTEGRATION_SITE**
[2]: https://app.datadoghq.com/account/settings/agent/latest
[3]: https://docs.datadoghq.com/agent/kubernetes/integrations/
[4]: https://github.com/DataDog/integrations-core/blob/master/kubeflow/datadog_checks/kubeflow/data/conf.yaml.example
[5]: https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent
[6]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
[7]: https://github.com/DataDog/integrations-core/blob/master/kubeflow/metadata.csv
[8]: https://github.com/DataDog/integrations-core/blob/master/kubeflow/assets/service_checks.json
[9]: https://docs.datadoghq.com/help/
16 changes: 16 additions & 0 deletions kubeflow/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Kubeflow
files:
- name: kubeflow.yaml
options:
- template: init_config
options:
- template: init_config/openmetrics
- template: instances
options:
- template: instances/openmetrics
overrides:
openmetrics_endpoint.required: true
openmetrics_endpoint.value.example: http://<prometheus-service>:9090/metrics
openmetrics_endpoint.description: |
Endpoint exposing the Kubeflow's Prometheus metrics.
Loading

0 comments on commit fe2c262

Please sign in to comment.