diff --git a/docs/deployment/configuration/monitoring.rst b/docs/deployment/configuration/monitoring.rst index 75bc89adc4..7b0d9ddc0b 100644 --- a/docs/deployment/configuration/monitoring.rst +++ b/docs/deployment/configuration/monitoring.rst @@ -85,15 +85,55 @@ Use Published Dashboards to Monitor Flyte Deployment Flyte Backend is written in Golang and exposes stats using Prometheus. The stats are labeled with workflow, task, project & domain, wherever appropriate. -The dashboards are divided into two types: +Both ``flyteadmin`` and ``flytepropeller`` are instrumented to expose metrics. To visualize these metrics, Flyte provides three Grafana dashboards, each with a different focus: - **User-facing dashboards**: Dashboards that can be used to triage/investigate/observe performance and characteristics of workflows and tasks. - The user-facing dashboard is published under Grafana marketplace ID `13980 `__. + The user-facing dashboard is published under ID `13980 `__ in the Grafana marketplace. -- **System Dashboards**: Dashboards that are useful for the system maintainer to maintain their Flyte deployments. These are further divided into: - - DataPlane/FlytePropeller dashboards published @ `13979 `__ - - ControlPlane/Flyteadmin dashboards published @ `13981 `__ +- **System Dashboards**: Dashboards that are useful for the system maintainer to investigate the status and performance of their Flyte deployments. These are further divided into: + - `DataPlane/FlytePropeller `__: execution engine status and performance. + - `ControlPlane/Flyteadmin `__: API-level monitoring. + +The corresponding JSON files for each dashboard are also located at ``deployment/stats/prometheus``. + +.. note:: + + The dashboards are basic dashboards and do not include all the metrics exposed by Flyte. + Feel free to use the scripts provided `here `__ to improve and -hopefully- contribute the improved dashboards. + +How to use the dashboards +~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. We recommend installing and configuring the Prometheus operator as described in `their docs `__. +This is especially true if you plan to use the Service Monitors provided by the `flyte-core `__ Helm chart. + +2. Enable the Prometheus instance to use Service Monitors in the namespace where Flyte is running, configuring the following keys in the ``prometheus`` resource: + +.. code-block:: yaml + + spec: + serviceMonitorSelector: {} + serviceMonitorNamespaceSelector: {} + +.. note:: + + The above example configuration lets Prometheus use any ``ServiceMonitor`` in any namespace in the cluster. Adjust the configuration to reduce the scope if needed. + +3. Once you have installed and configured the Prometheus operator, enable the Service Monitors in the Helm chart by configuring the following keys in your ``values`` file: + +.. code-block:: yaml + + flyteadmin: + serviceMonitor: + enabled: true + + flytepropeller: + serviceMonitor: + enabled: true + +.. note:: + + By default, the ``ServiceMonitor`` is configured with a ``scrapeTimeout`` of 30s and and ``interval`` of 60s. You can customize these values if needed. + +With the above configuration in place you should be able to import the dashboards in your Grafana instance. -The above mentioned are basic dashboards and do no include all the metrics exposed by Flyte. -Please help us improve the dashboards by contributing to them 🙏. -Refer to the build scripts `here `__. \ No newline at end of file