Skip to content

Add VictoriaMetrics switch guide for TiUP cluster #20335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions maintain-tidb-using-tiup.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,176 @@ tiup cluster clean ${cluster-name} --all --ignore-node 172.16.13.11:9000
tiup cluster clean ${cluster-name} --all --ignore-node 172.16.13.12
```

## 从 Prometheus 切换到 VictoriaMetrics

在大型集群中,Prometheus 可能会面临效率挑战,特别是当集群中有大量实例时。从 tiup 1.16.3 版本开始,TiUP 支持将指标服务器从 Prometheus 切换到 VictoriaMetrics (VM),以提供更好的可扩展性、更高的性能和更低的资源消耗。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
在大型集群中,Prometheus 可能会面临效率挑战,特别是当集群中有大量实例时。从 tiup 1.16.3 版本开始,TiUP 支持将指标服务器从 Prometheus 切换到 VictoriaMetrics (VM),以提供更好的可扩展性、更高的性能和更低的资源消耗。
在大型集群中,Prometheus 在面对大量实例时可能面临性能瓶颈。从 tiup 1.16.3 版本开始,TiUP 支持将指标服务器从 Prometheus 切换为 VictoriaMetrics (VM),以提供更好的可扩展性、更高的性能和更低的资源消耗。


### 为新部署设置 VictoriaMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### 为新部署设置 VictoriaMetrics
### 在新部署中启用 VictoriaMetrics


默认情况下,TiUP 使用 Prometheus 作为指标服务器。如果要在新部署中使用 VictoriaMetrics 替代 Prometheus,可以在拓扑文件中进行如下配置:

```yaml
# 监控服务器配置
monitoring_servers:
# 监控服务器的 IP 地址
- host: ip_address
...
prom_remote_write_to_vm: true
enable_prom_agent_mode: true

# Grafana 服务器配置
grafana_servers:
# Grafana 服务器的 IP 地址
- host: ip_address
...
use_vm_as_datasource: true
```

### 将现有部署迁移到 VictoriaMetrics

迁移过程可以在不影响运行中的实例的情况下进行。现有的指标将保留在 Prometheus 中,而最新的指标将写入 VictoriaMetrics。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
迁移过程可以在不影响运行中的实例的情况下进行。现有的指标将保留在 Prometheus 中,而最新的指标将写入 VictoriaMetrics。
迁移过程可在不中断服务的前提下进行:现有历史指标仍保留在 Prometheus 中,新的指标则写入 VictoriaMetrics。


#### 启用 VictoriaMetrics 远程写入
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### 启用 VictoriaMetrics 远程写入
#### 启用 Prometheus 向 VictoriaMetrics 的远程写入


1. 编辑集群配置:

{{< copyable "shell-regular" >}}

```bash
tiup cluster edit-config ${cluster-name}
```

2. 在 `monitoring_servers` 下设置 `prom_remote_write_to_vm` 为 `true`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2.`monitoring_servers` 下设置 `prom_remote_write_to_vm` `true`
2.`monitoring_servers` 配置下,添加 `prom_remote_write_to_vm`: `true`


```yaml
monitoring_servers:
- host: ip_address
...
prom_remote_write_to_vm: true
```

3. 重新加载更新后的配置:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. 重新加载更新后的配置
3. 重新加载配置使其生效


{{< copyable "shell-regular" >}}

```bash
tiup cluster reload ${cluster-name} -R prometheus
```

#### 将默认数据源切换到 VictoriaMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### 将默认数据源切换到 VictoriaMetrics
#### 切换 Grafana 默认数据源至 VictoriaMetrics


1. 编辑集群配置:

{{< copyable "shell-regular" >}}

```bash
tiup cluster edit-config ${cluster-name}
```

2. 在 `grafana_servers` 下设置 `use_vm_as_datasource` 为 `true`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2.`grafana_servers` 下设置 `use_vm_as_datasource` `true`
2.`grafana_servers` 配置下添加 `use_vm_as_datasource`: `true`


```yaml
grafana_servers:
- host: ip_address
...
use_vm_as_datasource: true
```

3. 重新加载更新后的配置:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. 重新加载更新后的配置
3. 重新加载配置使其生效


{{< copyable "shell-regular" >}}

```bash
tiup cluster reload ${cluster-name} -R grafana
```

#### (可选) 查看切换前生成的指标
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### (可选) 查看切换前生成的指标
####(可选)查看切换前的历史指标


如果需要查看切换前生成的历史指标,可以按照以下步骤切换 Grafana 的数据源:

1. 编辑集群配置:

{{< copyable "shell-regular" >}}

```bash
tiup cluster edit-config ${cluster-name}
```

2. 注释掉 `grafana_servers` 下的 `use_vm_as_datasource`:

```yaml
grafana_servers:
- host: ip_address
...
# use_vm_as_datasource: true
```

3. 重新加载更新后的配置:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. 重新加载更新后的配置
3. 重新加载配置使其生效


{{< copyable "shell-regular" >}}

```bash
tiup cluster reload ${cluster-name} -R grafana
```

4. 要切换回 VictoriaMetrics,请重复"将默认数据源切换到 VictoriaMetrics"的步骤。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. 要切换回 VictoriaMetrics,请重复"将默认数据源切换到 VictoriaMetrics"的步骤。
4. 若需切换回 VictoriaMetrics,请重复"切换 Grafana 默认数据源至 VictoriaMetrics"的步骤。


### 清理旧指标和服务

确认旧指标已过期后,可以按照以下步骤移除冗余服务和文件,这不会影响运行中的集群。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
确认旧指标已过期后,可以按照以下步骤移除冗余服务和文件,这不会影响运行中的集群
在确认旧指标已过期的前提下,可按以下步骤移除相关冗余服务和文件,这不会影响集群的正常运行


#### 将 Prometheus 设置为代理模式

1. 编辑集群配置:

{{< copyable "shell-regular" >}}

```bash
tiup cluster edit-config ${cluster-name}
```

2. 在 `monitoring_servers` 下设置 `enable_prom_agent_mode` 为 `true`,并确保 `prom_remote_write_to_vm` 和 `use_vm_as_datasource` 也正确设置:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. `monitoring_servers` 下设置 `enable_prom_agent_mode``true`,并确保 `prom_remote_write_to_vm``use_vm_as_datasource` 也正确设置:
2. 设置代理模式并确保相关参数已正确配置
`monitoring_servers` 下设置 `enable_prom_agent_mode``true`,并确保 `prom_remote_write_to_vm``use_vm_as_datasource` 也正确设置:


```yaml
monitoring_servers:
- host: ip_address
...
prom_remote_write_to_vm: true
enable_prom_agent_mode: true

grafana_servers:
- host: ip_address
...
use_vm_as_datasource: true
```

3. 重新加载更新后的配置:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. 重新加载更新后的配置
3. 重新加载配置使其生效


{{< copyable "shell-regular" >}}

```bash
tiup cluster reload ${cluster-name} -R prometheus
```

#### 移除过期的数据目录
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### 移除过期的数据目录
#### 删除 Prometheus 旧数据目录


1. 在配置文件中找到监控服务器的 `data_dir`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. 在配置文件中找到监控服务器的 `data_dir`
1. 在配置文件中找到监控服务器的数据目录路径 `data_dir`


```yaml
monitoring_servers:
- host: ip_address
...
data_dir: "/tidb-data/prometheus-8249"
```

2. 移除数据目录:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. 移除数据目录
2. 手动删除数据目录


{{< copyable "shell-regular" >}}

```bash
rm -rf /tidb-data/prometheus-8249
```

## 销毁集群

销毁集群操作会关闭服务,清空数据目录和部署目录,并且无法恢复,需要**谨慎操作**。
Expand Down