Skip to content

Monitoring a cluster using round robin DNS

znerol edited this page Feb 19, 2021 · 2 revisions

Prometheus PVE exporter is designed to return identical metrics no matter which node in a cluster is scraped. Thus for simple setups it is enough to simply scrape one cluster node. However, in production deployments it is desirable to have metrics available even when a cluster node is down.

The simplest way to implement a fallback mechanism for when Prometheus PVE exporter or a whole cluster node is down is to implement round-robin DNS.

We assume the following cluster configuration with three PVE nodes:

  • pve-a.example.org: 2001:db8::a
  • pve-b.example.org: 2001:db8::b
  • pve-c.example.org: 2001:db8::c

Setup round-robin DNS

In order to implement round-robin DNS it is necessary to configure an additional DNS records for each PVE node with a common label (assume this is in zone example.org):

pve  300 IN AAAA 2001:db8::a
pve  300 IN AAAA 2001:db8::b
pve  300 IN AAAA 2001:db8::c

Setup scrape config

Assuming that Prometheus PVE exporter is running on every node, the targets parameter of the PVE job can now be set to pve.example.org:

scrape_configs:
  - job_name: 'pve'
    static_configs:
      - targets:
        - pve.example.org:9221
    metrics_path: /pve
    params:
      module: [default]

Whenever prometheus tries to scrape a node which is not available, it will retry with another IP from the pve.example.org record after a short timeout.