Skip to content

Commit

Permalink
First pass at resolving the snmp test failures
Browse files Browse the repository at this point in the history
The issue may have been caused by the observability strategy tests causing both community and redhat monitoring components to be installed and shadowing the monitoring component definitions (prom, alertmanager, etc)
  • Loading branch information
elfiesmelfie committed Oct 25, 2024
1 parent 35c1956 commit cf0b8ec
Showing 1 changed file with 62 additions and 19 deletions.
81 changes: 62 additions & 19 deletions roles/test_snmp_traps/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,34 +3,77 @@
# Following procedure on https://infrawatch.github.io/documentation/#configuring-snmp-traps_assembly-advanced-features
# Assuming we're in the right project already...

- name: "RHELOSP-144987"
# description: "Set the alerting.alertmanager.receivers.snmpTraps parameters"
# I think that messing with the observability strategy might have effected the results of all the tests.
# I'm going to re-run the job and see if the results are the same now that the observability strategy role is not being run
# The other roles likely need a similar workaround to this....
- name: Get the observability strategy

Check failure on line 9 in roles/test_snmp_traps/tasks/main.yml

View workflow job for this annotation

GitHub Actions / build

no-changed-when

Commands should not change things if nothing needs doing.
ansible.builtin.shell: |
cmd: |
oc get stf default -ojsonpath='{.spec.observabilityStrategy}'
register: observability_strategy

- name: "Set the observability api based on the observability strategy"
ansible.builtin.set_fact:
observability_api: "{{ 'monitoring.rhobs' if observability_strategy.stdout == 'use_redhat' else 'monitoring.coreos.com' }}"

- name: "RHELOSP-144987 Set the alerting.alertmanager.receivers.snmpTraps parameters"
ansible.builtin.shell:
cmd: |
oc patch stf/default --type merge -p '{"spec": {"alerting": {"alertmanager": {"receivers": {"snmpTraps": {"enabled": true, "target": "10.10.10.10" }}}}}}'
changed_when: false
register: cmd_output
failed_when: cmd_output.rc != 0


- name: "RHELOSP-144966"
# description: "Interrupt metrics flow by preventing the QDR from running"
# Note: the apiversion used depends on the observability strategy.
# There should be some parameter passed here to select the api based on observability strategy
- name: "Create an alert for an interrruption to metrics"

Check failure on line 29 in roles/test_snmp_traps/tasks/main.yml

View workflow job for this annotation

GitHub Actions / build

no-changed-when

Commands should not change things if nothing needs doing.
ansible.builtin.shell:
cmd: |
for i in {1..30}; do oc delete po -l application=default-interconnect; sleep 1; done
changed_when: false
oc apply -f - <<EOF
apiVersion: {{ observability_api }}/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
labels:
prometheus: default
role: alert-rules
name: test-prometheus-alarm-rules-snmp
namespace: service-telemetry
spec:
groups:
- name: ./openstack.rules
rules:
- alert: Collectd metrics receive rate is zero
expr: rate(sg_total_collectd_msg_received_count[1m]) == 0
labels:
oid: 1.3.6.1.4.1.50495.15.1.2.1
severity: critical
EOF
- name: "RHELOSP-144481"
# description: "Check for snmpTraps logs"
ansible.builtin.shell:
cmd: |
oc logs -l "app=default-snmp-webhook" | grep "Sending SNMP trap" | wc -l
register: cmd_output
changed_when: false
failed_when: "cmd_output.stdout|int == 0"
- name: "Run the test"
block:
- name: "RHELOSP-144966 Interrupt metrics flow by preventing the QDR from running"
ansible.builtin.shell:
cmd: |
for i in {1..30}; do oc delete po -l application=default-interconnect; sleep 1; done
changed_when: false

- name: "RHELOSP-144481 Check for snmpTraps logs"
ansible.builtin.shell:
cmd: |
oc logs -l "app=default-snmp-webhook" | grep "Sending SNMP trap"
register: cmd_output
changed_when: false
failed_when: "cmd_output.stdout_lines | length == 0"

- name: "Wait 2 minutes to make sure all SG pods are back to normal"
ansible.builtin.pause:
minutes: 2
changed_when: false
always:
- name: "Delete the alert"

Check failure on line 70 in roles/test_snmp_traps/tasks/main.yml

View workflow job for this annotation

GitHub Actions / build

no-changed-when

Commands should not change things if nothing needs doing.
ansible.builtin.shell:
cmd: |
oc delete prometheusrules.{{ observability_api }} test-prometheus-alarm-rules-snmp
# TODO: update the test to check for the SG pods being recreated
- name: "Wait 2 minutes to make sure all SG pods are back to normal"
ansible.builtin.pause:
minutes: 2
changed_when: false

0 comments on commit cf0b8ec

Please sign in to comment.