Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INFRA-709 Rated dwpd alerts #1077

Open
wants to merge 3 commits into
base: stackhpc/2023.1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions etc/kayobe/ansible/get-nvme-drives.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
- name: Gather unique NVMe disk models and generate a prepopulated variable template
hosts: overcloud
gather_facts: no
tasks:
- name: Get NVMe device information
command: "nvme list -o json"
dougszumski marked this conversation as resolved.
Show resolved Hide resolved
register: nvme_list
changed_when: false
become: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is root required? At least on Fedora 40 it doesn't appear to be. Feel free to ignore if you're not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if all disks are listed if not root


- name: Parse NVMe device model names
set_fact:
nvme_models: "{{ nvme_models | default([]) + [item.ModelNumber] }}"
loop: "{{ nvme_list.stdout | from_json | json_query('Devices[].{ModelNumber: ModelNumber}') }}"
changed_when: false

- name: Set gathered NVMe models as host facts
set_fact:
unique_nvme_models: "{{ nvme_models | unique }}"
run_once: true

- name: Update stackhpc-monitoring.yml with DWPD ratings
hosts: localhost
gather_facts: no
tasks:
- name: Aggregate unique NVMe models from all hosts
set_fact:
all_nvme_models: "{{ all_nvme_models | default([]) | union(hostvars[item].unique_nvme_models | default([])) }}"
with_items: "{{ groups['overcloud'] }}"
run_once: true

- name: Ensure unique NVMe models
set_fact:
all_nvme_models: "{{ all_nvme_models | unique }}"
run_once: true

- name: Create a dictionary for quick lookup of DWPD ratings
set_fact:
dwpd_lookup: "{{ stackhpc_dwpd_ratings | items2dict(key_name='model_name', value_name='rated_dwpd') }}"
when: stackhpc_dwpd_ratings is defined and stackhpc_dwpd_ratings | length > 0
run_once: true

- name: Generate new DWPD ratings section
set_fact:
new_dwpd_section: |
stackhpc_dwpd_ratings:
{% for model in all_nvme_models %}
- model_name: "{{ model }}"
rated_dwpd: "{{ dwpd_lookup[model] if model in dwpd_lookup else '#FILL ME IN' }}"
{% endfor %}
run_once: true

- name: Read the current stackhpc-monitoring.yml file
slurp:
src: "{{ playbook_dir }}/../stackhpc-monitoring.yml"
register: monitoring_file_content

- name: Ensure markers exist in the file
set_fact:
markers_exist: "{{ ('# BEGIN DWPD Ratings' in old_content) and ('# END DWPD Ratings' in old_content) }}"
vars:
old_content: "{{ monitoring_file_content.content | b64decode }}"
run_once: true

- name: Fail if markers do not exist
fail:
msg: "The stackhpc-monitoring.yml file does not contain the required markers: # BEGIN DWPD Ratings and # END DWPD Ratings"
when: not markers_exist
run_once: true

- name: Update the content with new DWPD ratings section
set_fact:
updated_monitoring_content: |
{% set old_content = monitoring_file_content.content | b64decode %}
{% set before_section = old_content.split('# BEGIN DWPD Ratings')[0] %}
{% set after_section = old_content.split('# END DWPD Ratings')[1] %}
dougszumski marked this conversation as resolved.
Show resolved Hide resolved
{{ before_section }}# BEGIN DWPD Ratings
{{ new_dwpd_section }}
# END DWPD Ratings{{ after_section }}
when: markers_exist
run_once: true

- name: Write the updated content back to stackhpc-monitoring.yml
copy:
content: "{{ updated_monitoring_content }}"
dest: "{{ playbook_dir }}/../stackhpc-monitoring.yml"
backup: yes
when: markers_exist
run_once: true

- name: Print new DWPD ratings section
technowhizz marked this conversation as resolved.
Show resolved Hide resolved
debug:
msg:
- "{{ new_dwpd_section }}"
- "PLEASE UPDATE stackhpc-monitoring.yml IF NEEDED AND REMEMBER TO COMMIT THE FILE TO GIT"
run_once: true
73 changes: 53 additions & 20 deletions etc/kayobe/ansible/scripts/nvmemon.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,28 @@ if ! command -v nvme >/dev/null 2>&1; then
exit 1
fi

# Set path to the DWPD ratings file
dwpd_file="/opt/kayobe/etc/monitoring/dwpd_ratings.yml"

# Function to load rated DWPD values from the YML file
load_dwpd_ratings() {
declare -gA rated_dwpd
if [[ -f "$dwpd_file" ]]; then
while IFS= read -r line; do
key="$(echo "$line" | jq -r '.model_name')"
value="$(echo "$line" | jq -r '.rated_dwpd')"
# Strip trailing spaces
key="$(echo "$key" | sed 's/[[:space:]]*$//')"
value="$(echo "$value" | sed 's/[[:space:]]*$//')"
rated_dwpd["$key"]="$value"
done < <(jq -c '.[]' "$dwpd_file")
else
echo "Warning: DWPD ratings file not found at $dwpd_file. Defaulting to 1 DWPD."
fi
}

load_dwpd_ratings

output_format_awk="$(
cat <<'OUTPUTAWK'
BEGIN { v = "" }
Expand All @@ -44,58 +66,69 @@ format_output() {
nvme_version="$(nvme version | awk '$1 == "nvme" {print $3}')"
echo "nvmecli{version=\"${nvme_version}\"} 1" | format_output

# Get devices (DevicePath and PhysicalSize)
device_info="$(nvme list -o json | jq -c '.Devices[] | {DevicePath: .DevicePath, PhysicalSize: .PhysicalSize}')"
# Get devices (DevicePath, PhysicalSize and ModelNumber)
device_info="$(nvme list -o json | jq -c '.Devices[] | {DevicePath, PhysicalSize, ModelNumber}')"

# Convert device_info to an array
device_info_array=()
while IFS= read -r line; do
device_info_array+=("$line")
done <<< "$device_info"

# Loop through the NVMe devices
echo "$device_info" | while read -r device_data; do
device=$(echo "$device_data" | jq -r '.DevicePath')
for device_data in "${device_info_array[@]}"; do
device="$(echo "$device_data" | jq -r '.DevicePath')"
json_check="$(nvme smart-log -o json "${device}")"
disk="${device##*/}"
model_name="$(echo "$device_data" | jq -r '.ModelNumber')"

physical_size=$(echo "$device_data" | jq -r '.PhysicalSize')
echo "physical_size_bytes{device=\"${disk}\"} ${physical_size}"
physical_size="$(echo "$device_data" | jq -r '.PhysicalSize')"
echo "physical_size_bytes{device=\"${disk}\",model=\"${model_name}\"} ${physical_size}"

# The temperature value in JSON is in Kelvin, we want Celsius
value_temperature="$(echo "$json_check" | jq '.temperature - 273')"
echo "temperature_celsius{device=\"${disk}\"} ${value_temperature}"
echo "temperature_celsius{device=\"${disk}\",model=\"${model_name}\"} ${value_temperature}"

# Get the rated DWPD from the dictionary or default to 1 if not found
value_rated_dwpd="${rated_dwpd[$model_name]:-1}"
echo "rated_dwpd{device=\"${disk}\",model=\"${model_name}\"} ${value_rated_dwpd}"

value_available_spare="$(echo "$json_check" | jq '.avail_spare / 100')"
echo "available_spare_ratio{device=\"${disk}\"} ${value_available_spare}"
echo "available_spare_ratio{device=\"${disk}\",model=\"${model_name}\"} ${value_available_spare}"
dougszumski marked this conversation as resolved.
Show resolved Hide resolved

value_available_spare_threshold="$(echo "$json_check" | jq '.spare_thresh / 100')"
echo "available_spare_threshold_ratio{device=\"${disk}\"} ${value_available_spare_threshold}"
echo "available_spare_threshold_ratio{device=\"${disk}\",model=\"${model_name}\"} ${value_available_spare_threshold}"

value_percentage_used="$(echo "$json_check" | jq '.percent_used / 100')"
echo "percentage_used_ratio{device=\"${disk}\"} ${value_percentage_used}"
echo "percentage_used_ratio{device=\"${disk}\",model=\"${model_name}\"} ${value_percentage_used}"

value_critical_warning="$(echo "$json_check" | jq '.critical_warning')"
echo "critical_warning_total{device=\"${disk}\"} ${value_critical_warning}"
echo "critical_warning_total{device=\"${disk}\",model=\"${model_name}\"} ${value_critical_warning}"

value_media_errors="$(echo "$json_check" | jq '.media_errors')"
echo "media_errors_total{device=\"${disk}\"} ${value_media_errors}"
echo "media_errors_total{device=\"${disk}\",model=\"${model_name}\"} ${value_media_errors}"

value_num_err_log_entries="$(echo "$json_check" | jq '.num_err_log_entries')"
echo "num_err_log_entries_total{device=\"${disk}\"} ${value_num_err_log_entries}"
echo "num_err_log_entries_total{device=\"${disk}\",model=\"${model_name}\"} ${value_num_err_log_entries}"

value_power_cycles="$(echo "$json_check" | jq '.power_cycles')"
echo "power_cycles_total{device=\"${disk}\"} ${value_power_cycles}"
echo "power_cycles_total{device=\"${disk}\",model=\"${model_name}\"} ${value_power_cycles}"

value_power_on_hours="$(echo "$json_check" | jq '.power_on_hours')"
echo "power_on_hours_total{device=\"${disk}\"} ${value_power_on_hours}"
echo "power_on_hours_total{device=\"${disk}\",model=\"${model_name}\"} ${value_power_on_hours}"

value_controller_busy_time="$(echo "$json_check" | jq '.controller_busy_time')"
echo "controller_busy_time_seconds{device=\"${disk}\"} ${value_controller_busy_time}"
echo "controller_busy_time_seconds{device=\"${disk}\",model=\"${model_name}\"} ${value_controller_busy_time}"

value_data_units_written="$(echo "$json_check" | jq '.data_units_written')"
echo "data_units_written_total{device=\"${disk}\"} ${value_data_units_written}"
echo "data_units_written_total{device=\"${disk}\",model=\"${model_name}\"} ${value_data_units_written}"

value_data_units_read="$(echo "$json_check" | jq '.data_units_read')"
echo "data_units_read_total{device=\"${disk}\"} ${value_data_units_read}"
echo "data_units_read_total{device=\"${disk}\",model=\"${model_name}\"} ${value_data_units_read}"

value_host_read_commands="$(echo "$json_check" | jq '.host_read_commands')"
echo "host_read_commands_total{device=\"${disk}\"} ${value_host_read_commands}"
echo "host_read_commands_total{device=\"${disk}\",model=\"${model_name}\"} ${value_host_read_commands}"

value_host_write_commands="$(echo "$json_check" | jq '.host_write_commands')"
echo "host_write_commands_total{device=\"${disk}\"} ${value_host_write_commands}"
echo "host_write_commands_total{device=\"${disk}\",model=\"${model_name}\"} ${value_host_write_commands}"
done | format_output
19 changes: 18 additions & 1 deletion etc/kayobe/ansible/smartmon-tools.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
---
- hosts: overcloud

tasks:
- name: Ensure smartmontools, jq, nvme-cli and cron/cronie are installed
package:
Expand Down Expand Up @@ -49,3 +48,21 @@
- smartmon
- nvmemon
become: yes

- name: Ensure the DWPD Ratings directory exists
file:
path: /opt/kayobe/etc/monitoring
state: directory
mode: '0755'
when: stackhpc_dwpd_ratings is defined
become: true

- name: Create a DWPD ratings file
copy:
content: |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be the raw file that you write out above when running the playbook?

{% for drive in stackhpc_dwpd_ratings %}
{{ drive.model_name }}: {{ drive.rated_dwpd }}
{% endfor %}
dest: /opt/kayobe/etc/monitoring/dwpd_ratings.yml
when: stackhpc_dwpd_ratings is defined
become: true
8 changes: 4 additions & 4 deletions etc/kayobe/kolla/config/prometheus/smart.rules
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ groups:
description: "{{ $labels.instance }} is reporting unhealthy for the disk at {{ $labels.disk }}. Disk serial number is: {{ $labels.serial_number }}"

- alert: DWPDTooHigh
expr: (delta(nvme_data_units_written_total[30d])*512000 / nvme_physical_size_bytes) / 30 > 1
expr: (delta(nvme_data_units_written_total[30d])*512000 / nvme_physical_size_bytes) / 30 > nvme_rated_dwpd
labels:
severity: alert
annotations:
summary: "High 30-Day Average DWPD for {{ $labels.instance }}"
description: "The 30-Day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds 1 DWPD"
description: "The 30-Day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds the rated DWPD"

- alert: DWPDTooHighWarning
expr: (delta(nvme_data_units_written_total[7d])*512000 / nvme_physical_size_bytes) / 7 > 1
expr: (delta(nvme_data_units_written_total[7d])*512000 / nvme_physical_size_bytes) / 7 > nvme_rated_dwpd
labels:
severity: warning
annotations:
summary: "High 7-Day Average DWPD for {{ $labels.instance }}"
description: "The 7-day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds 1 DWPD"
description: "The 7-day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds the rated DWPD"

{% endraw %}
3 changes: 3 additions & 0 deletions etc/kayobe/stackhpc-monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,6 @@ stackhpc_enable_os_capacity: true
# Whether TLS certificate verification is enabled for the OpenStack Capacity
# exporter during Keystone authentication.
stackhpc_os_capacity_openstack_verify: true

# BEGIN DWPD Ratings
# END DWPD Ratings
8 changes: 8 additions & 0 deletions releasenotes/notes/rated-dwpd-40526e85e24ef7ea.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
features:
- |
Add support of the operator supplying the rated DWPD value for NVMe drives.
There is a playbook ``get-nvme-drives.yml`` that will populate a new
section in the ``stackhpc-monitoring.yml`` file with drive model names for
NVMes in the cloud. The operator can then fill in the rated DWPD values for
each drive.
Loading