diff --git a/.automation b/.automation index 7676aa89f..a7de3abb3 160000 --- a/.automation +++ b/.automation @@ -1 +1 @@ -Subproject commit 7676aa89f0fde7291a846179c8820a8acc5c69ce +Subproject commit a7de3abb3f0bf529e78c4ba9ad1cbe26d356dd3b diff --git a/doc/source/configuration/release-train.rst b/doc/source/configuration/release-train.rst index 0d62fadfd..f77109aff 100644 --- a/doc/source/configuration/release-train.rst +++ b/doc/source/configuration/release-train.rst @@ -1,3 +1,5 @@ +.. _stackhpc_release_train: + ====================== StackHPC Release Train ====================== diff --git a/doc/source/contributor/package-updates.rst b/doc/source/contributor/package-updates.rst index 6e324d6c3..5469745c7 100644 --- a/doc/source/contributor/package-updates.rst +++ b/doc/source/contributor/package-updates.rst @@ -63,18 +63,20 @@ The following steps describe the process to test the new package and container r Creating the multinode environments ----------------------------------- -There is a comprehensive guide to setting up a multinode environment with Terraform, found here: https://github.com/stackhpc/terraform-kayobe-multinode. There are some things to note: +The `Multinode deployment workflow `_ can be used to automatically test changes. + +To manually test the changes, there is a comprehensive guide to set up a Multinode environment with Terraform, found here: https://github.com/stackhpc/terraform-kayobe-multinode. There are some things to note: * OVN is enabled by default, you should override it under ``etc/kayobe/environments/ci-multinode/kolla.yml kolla_enable_ovn: false`` for the OVS multinode environment. -* Remember to set different vxlan_vnis for each. +* Remember to set a different ``vxlan_vni`` for each. -* Before starting any tests, run ``dnf distro-sync`` on each host to ensure you are using the same snapshots as in the release train. You can do this using the following commands: +* Before starting any tests, run ``dnf distro-sync -y`` on each host to ensure you are using the same snapshots as in the release train. Option ``-y`` is used to prevent hosts hang waiting for the confirmation input. You can do this using the following commands: .. code-block:: console - kayobe seed host command run -b --command "dnf distro-sync" - kayobe overcloud host command run -b --command "dnf distro-sync" + kayobe seed host command run -b --command "dnf distro-sync -y" + kayobe overcloud host command run -b --command "dnf distro-sync -y" * This may have installed a new kernel version. If so, you will need to reboot the overcloud hosts. You can check the installed kernels and the currently running kernel with the following commands. If the latest listed version is not running, you will need to reboot. @@ -85,7 +87,7 @@ There is a comprehensive guide to setting up a multinode environment with Terraf kayobe playbook run --limit seed,overcloud $KAYOBE_CONFIG_PATH/ansible/reboot.yml -* The tempest tests run automatically at the end of deploy-openstack.sh. If you have the time, it is worth fixing any failing tests you can so that there is greater coverage for the package updates. (Also remember to propose these fixes in the relevant repos where applicable.) +* The tempest tests run automatically at the end of the multinode deployment script. If you have the time, it is worth fixing any failing tests you can so that there is greater coverage for the package updates. (Also remember to propose these fixes in the relevant repos where applicable.) Upgrading host packages ----------------------- @@ -102,6 +104,7 @@ For Rocky Linux 9, bump the snapshot versions in /etc/yum/repos.d with: .. code-block:: console + kayobe seed host configure -t dnf kayobe overcloud host configure -t dnf Install new packages: @@ -112,22 +115,32 @@ Install new packages: Perform a rolling reboot of hosts: +.. note:: + In the Multinode environment, the seed-hypervisor cannot access control + plane instances with the Openstack client. To use Openstack client, connect + to the Seed instance via SSH first. For authentication, use scp to copy + ``public-openrc.sh`` to the Seed + .. code-block:: console - export ANSIBLE_SERIAL=1 - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit controllers - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit compute[0] + # Check your hypervisor hostname + (seed) openstack hypervisor list + + # Reboot controller instances and zeroth compute instance + (seed-hypervisor) export ANSIBLE_SERIAL=1 + (seed-hypervisor) kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit controllers + (seed-hypervisor) kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit compute[0] # Test live migration - openstack server create --image cirros --flavor m1.tiny --network external --hypervisor-hostname antelope-pkg-refresh-ovs-compute-02.novalocal --os-compute-api-version 2.74 server1 - openstack server migrate --live-migration server1 - watch openstack server show server1 + (seed) openstack server create --image cirros --flavor m1.tiny --network external --hypervisor-hostname --os-compute-api-version 2.74 server1 + (seed) openstack server migrate --live-migration server1 + (seed) watch openstack server show server1 - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit compute[1] + (seed-hypervisor) kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit compute[1] # Try and migrate back - openstack server migrate --live-migration server1 - watch openstack server show server1 + (seed) openstack server migrate --live-migration server1 + (seed) watch openstack server show server1 Upgrading containers within a release ------------------------------------- diff --git a/doc/source/operations/upgrading-openstack.rst b/doc/source/operations/upgrading-openstack.rst index 3821280a3..ef387c4ca 100644 --- a/doc/source/operations/upgrading-openstack.rst +++ b/doc/source/operations/upgrading-openstack.rst @@ -449,9 +449,8 @@ To upgrade the Ansible control host: Syncing Release Train artifacts ------------------------------- -New `StackHPC Release Train <../configuration/release-train>` content should be -synced to the local Pulp server. This includes host packages (Deb/RPM) and -container images. +New :ref:`stackhpc_release_train` content should be synced to the local Pulp +server. This includes host packages (Deb/RPM) and container images. .. _sync-rt-package-repos: @@ -968,17 +967,27 @@ would be applied: kayobe overcloud host configure --check --diff When ready to apply the changes, it may be advisable to do so in batches, or at -least start with a small number of hosts.: +least start with a small number of hosts: .. code-block:: console kayobe overcloud host configure --limit -Alternatively, to apply the configuration to all hosts: -.. code-block:: console +.. warning:: + + Take extra care when configuring Ceph hosts. Set the hosts to maintenance + mode before reconfiguring them, and unset when done: + + .. code-block:: console + + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/ceph-enter-maintenance.yml --limit + kayobe overcloud host configure --limit + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/ceph-exit-maintenance.yml --limit - kayobe overcloud host configure + **Always** reconfigure hosts in small batches or one-by-one. Check the Ceph + state after each host configuration. Ensure all warnings and errors are + resolved before moving on. .. _building_ironic_deployment_images: diff --git a/etc/kayobe/ansible/deploy-os-capacity-exporter.yml b/etc/kayobe/ansible/deploy-os-capacity-exporter.yml index 41d91bfbd..f0a2f7c9c 100644 --- a/etc/kayobe/ansible/deploy-os-capacity-exporter.yml +++ b/etc/kayobe/ansible/deploy-os-capacity-exporter.yml @@ -15,59 +15,61 @@ tags: os_capacity gather_facts: false tasks: - - name: Create os-capacity directory - ansible.builtin.file: - path: /opt/kayobe/os-capacity/ - state: directory - when: stackhpc_enable_os_capacity - - - name: Read admin-openrc credential file - ansible.builtin.command: - cmd: "cat {{ lookup('ansible.builtin.env', 'KOLLA_CONFIG_PATH') }}/admin-openrc.sh" + - name: Check if admin-openrc.sh exists + ansible.builtin.stat: + path: "{{ lookup('ansible.builtin.env', 'KOLLA_CONFIG_PATH') }}/admin-openrc.sh" delegate_to: localhost - register: credential - when: stackhpc_enable_os_capacity - changed_when: false + register: openrc_file_stat + run_once: true - - name: Set facts for admin credentials - ansible.builtin.set_fact: - stackhpc_os_capacity_auth_url: "{{ credential.stdout_lines | select('match', '.*OS_AUTH_URL*.') | first | split('=') | last | replace(\"'\",'') }}" - stackhpc_os_capacity_project_name: "{{ credential.stdout_lines | select('match', '.*OS_PROJECT_NAME*.') | first | split('=') | last | replace(\"'\",'') }}" - stackhpc_os_capacity_domain_name: "{{ credential.stdout_lines | select('match', '.*OS_PROJECT_DOMAIN_NAME*.') | first | split('=') | last | replace(\"'\",'') }}" - stackhpc_os_capacity_openstack_region_name: "{{ credential.stdout_lines | select('match', '.*OS_REGION_NAME*.') | first | split('=') | last | replace(\"'\",'') }}" - stackhpc_os_capacity_username: "{{ credential.stdout_lines | select('match', '.*OS_USERNAME*.') | first | split('=') | last | replace(\"'\",'') }}" - stackhpc_os_capacity_password: "{{ credential.stdout_lines | select('match', '.*OS_PASSWORD*.') | first | split('=') | last | replace(\"'\",'') }}" - when: stackhpc_enable_os_capacity + - block: + - name: Create os-capacity directory + ansible.builtin.file: + path: /opt/kayobe/os-capacity/ + state: directory - - name: Template clouds.yml - ansible.builtin.template: - src: templates/os_capacity-clouds.yml.j2 - dest: /opt/kayobe/os-capacity/clouds.yaml - when: stackhpc_enable_os_capacity - register: clouds_yaml_result + - name: Read admin-openrc credential file + ansible.builtin.command: + cmd: "cat {{ lookup('ansible.builtin.env', 'KOLLA_CONFIG_PATH') }}/admin-openrc.sh" + delegate_to: localhost + register: credential + changed_when: false - - name: Copy CA certificate to OpenStack Capacity nodes - ansible.builtin.copy: - src: "{{ stackhpc_os_capacity_openstack_cacert }}" - dest: /opt/kayobe/os-capacity/cacert.pem - when: - - stackhpc_enable_os_capacity - - stackhpc_os_capacity_openstack_cacert | length > 0 - register: cacert_result + - name: Set facts for admin credentials + ansible.builtin.set_fact: + stackhpc_os_capacity_auth_url: "{{ credential.stdout_lines | select('match', '.*OS_AUTH_URL*.') | first | split('=') | last | replace(\"'\",'') }}" + stackhpc_os_capacity_project_name: "{{ credential.stdout_lines | select('match', '.*OS_PROJECT_NAME*.') | first | split('=') | last | replace(\"'\",'') }}" + stackhpc_os_capacity_domain_name: "{{ credential.stdout_lines | select('match', '.*OS_PROJECT_DOMAIN_NAME*.') | first | split('=') | last | replace(\"'\",'') }}" + stackhpc_os_capacity_openstack_region_name: "{{ credential.stdout_lines | select('match', '.*OS_REGION_NAME*.') | first | split('=') | last | replace(\"'\",'') }}" + stackhpc_os_capacity_username: "{{ credential.stdout_lines | select('match', '.*OS_USERNAME*.') | first | split('=') | last | replace(\"'\",'') }}" + stackhpc_os_capacity_password: "{{ credential.stdout_lines | select('match', '.*OS_PASSWORD*.') | first | split('=') | last | replace(\"'\",'') }}" - - name: Ensure os_capacity container is running - community.docker.docker_container: - name: os_capacity - image: ghcr.io/stackhpc/os-capacity:master - env: - OS_CLOUD: openstack - OS_CLIENT_CONFIG_FILE: /etc/openstack/clouds.yaml - mounts: - - type: bind - source: /opt/kayobe/os-capacity/ - target: /etc/openstack/ - network_mode: host - restart: "{{ clouds_yaml_result is changed or cacert_result is changed }}" - restart_policy: unless-stopped - become: true - when: stackhpc_enable_os_capacity + - name: Template clouds.yml + ansible.builtin.template: + src: templates/os_capacity-clouds.yml.j2 + dest: /opt/kayobe/os-capacity/clouds.yaml + register: clouds_yaml_result + + - name: Copy CA certificate to OpenStack Capacity nodes + ansible.builtin.copy: + src: "{{ stackhpc_os_capacity_openstack_cacert }}" + dest: /opt/kayobe/os-capacity/cacert.pem + when: stackhpc_os_capacity_openstack_cacert | length > 0 + register: cacert_result + + - name: Ensure os_capacity container is running + community.docker.docker_container: + name: os_capacity + image: ghcr.io/stackhpc/os-capacity:{{ stackhpc_os_capacity_version }} + env: + OS_CLOUD: openstack + OS_CLIENT_CONFIG_FILE: /etc/openstack/clouds.yaml + mounts: + - type: bind + source: /opt/kayobe/os-capacity/ + target: /etc/openstack/ + network_mode: host + restart: "{{ clouds_yaml_result is changed or cacert_result is changed }}" + restart_policy: unless-stopped + become: true + when: stackhpc_enable_os_capacity and openrc_file_stat.stat.exists diff --git a/etc/kayobe/inventory/group_vars/cis-hardening/cis b/etc/kayobe/inventory/group_vars/cis-hardening/cis index 37d01492b..2c103cb34 100644 --- a/etc/kayobe/inventory/group_vars/cis-hardening/cis +++ b/etc/kayobe/inventory/group_vars/cis-hardening/cis @@ -51,6 +51,9 @@ rhel9cis_rule_6_1_15: false # filesystem. We do not want to change /var/lib/docker permissions. rhel9cis_no_world_write_adjust: false +# Prevent hardening from recursivley changing permissions on log files +rhel9cis_rule_4_2_3: false + # Configure log rotation to prevent audit logs from filling the disk rhel9cis_auditd: space_left_action: syslog @@ -153,6 +156,9 @@ ubtu22cis_no_owner_adjust: false ubtu22cis_no_world_write_adjust: false ubtu22cis_suid_adjust: false +# Prevent hardening from recursivley changing permissions on log files +ubtu22cis_rule_4_2_3: false + # Configure log rotation to prevent audit logs from filling the disk ubtu22cis_auditd: action_mail_acct: root diff --git a/etc/kayobe/ipa.yml b/etc/kayobe/ipa.yml index 0138c6c44..590ebbf33 100644 --- a/etc/kayobe/ipa.yml +++ b/etc/kayobe/ipa.yml @@ -30,7 +30,9 @@ # List of additional Diskimage Builder (DIB) elements to use when building IPA # images. Default is none. -#ipa_build_dib_elements_extra: +ipa_build_dib_elements_extra: + - extra-hardware + - mellanox # List of Diskimage Builder (DIB) elements to use when building IPA images. # Default is combination of ipa_build_dib_elements_default and @@ -117,7 +119,10 @@ #ipa_collectors_default: # List of additional inspection collectors to run. -#ipa_collectors_extra: +ipa_collectors_extra: + - "dmi-decode" + - "extra-hardware" + - "numa-topology" # List of inspection collectors to run. #ipa_collectors: @@ -135,7 +140,11 @@ #ipa_kernel_options_default: # List of additional kernel parameters for Ironic python agent. -#ipa_kernel_options_extra: +ipa_kernel_options_extra: + # Useful until NTP is configured by default + - ipa-insecure=1 + # Avoid disk benchmark failures on some NVMe drives + - nvme_core.multipath=N # List of kernel parameters for Ironic python agent. #ipa_kernel_options: diff --git a/etc/kayobe/kolla-image-tags.yml b/etc/kayobe/kolla-image-tags.yml index 68c331aba..9919548cb 100644 --- a/etc/kayobe/kolla-image-tags.yml +++ b/etc/kayobe/kolla-image-tags.yml @@ -6,6 +6,9 @@ kolla_image_tags: openstack: rocky-9: 2024.1-rocky-9-20240903T113235 ubuntu-jammy: 2024.1-ubuntu-jammy-20240917T091559 + blazar: + rocky-9: 2024.1-rocky-9-20241125T093138 + ubuntu-jammy: 2024.1-ubuntu-jammy-20241125T093138 heat: rocky-9: 2024.1-rocky-9-20240805T142526 nova: diff --git a/etc/kayobe/kolla.yml b/etc/kayobe/kolla.yml index 00ffa169b..20d5b1ac5 100644 --- a/etc/kayobe/kolla.yml +++ b/etc/kayobe/kolla.yml @@ -150,6 +150,10 @@ kolla_sources: type: git location: https://github.com/stackhpc/octavia.git reference: stackhpc/{{ openstack_release }} + blazar-base: + type: git + location: https://github.com/stackhpc/blazar + reference: stackhpc/master ############################################################################### # Kolla image build configuration. diff --git a/etc/kayobe/kolla/config/grafana/dashboards/openstack/openstack.json b/etc/kayobe/kolla/config/grafana/dashboards/openstack/openstack.json index 6841ad19f..4de56d678 100644 --- a/etc/kayobe/kolla/config/grafana/dashboards/openstack/openstack.json +++ b/etc/kayobe/kolla/config/grafana/dashboards/openstack/openstack.json @@ -2126,6 +2126,19 @@ "title": "Glance Images", "type": "timeseries" }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 73 + }, + "id": 11, + "panels": [], + "title": "Logs", + "type": "row" + }, { "datasource": { "type": "prometheus", @@ -2390,6 +2403,360 @@ } ], "type": "table" + }, + { + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "log": 2, + "type": "symlog" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "links": [ + { + "targetBlank": true, + "title": "Show in Opensearch", + "url": "http{% endraw %}{{ 's' if kolla_enable_tls_internal | bool else '' }}://{{ kolla_internal_vip_address }}{% raw %}:5601/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'${__from:date}',to:'${__to:date}'))&_a=(columns:!(_source),filters:!(),interval:auto,query:(language:lucene,query:'log_level:${loglevel:lucene} AND programname:(\"${__data.fields[\"programname.keyword\"]}\") AND Hostname:${host:lucene}'),sort:!())" + } + ], + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "programname.keyword" + }, + "properties": [ + { + "id": "displayName", + "value": "Program Name" + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 74 + }, + "id": 15, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "alias": "", + "bucketAggs": [ + { + "field": "programname.keyword", + "id": "3", + "settings": { + "min_doc_count": "1", + "order": "desc", + "orderBy": "_count", + "size": "20" + }, + "type": "terms" + }, + { + "field": "@timestamp", + "id": "2", + "settings": { + "interval": "1h", + "min_doc_count": "0", + "trimEdges": "0" + }, + "type": "date_histogram" + } + ], + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "format": "table", + "metrics": [ + { + "id": "1", + "type": "count" + } + ], + "query": "log_level:$loglevel AND programname:$program_name AND Hostname:$host", + "queryType": "lucene", + "refId": "A", + "timeField": "@timestamp" + } + ], + "title": "Number of $loglevel per service", + "type": "timeseries" + }, + { + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "inspect": false + }, + "links": [ + { + "targetBlank": true, + "title": "Show in Opensearch", + "url": "http{% endraw %}{{ 's' if kolla_enable_tls_internal | bool else '' }}://{{ kolla_internal_vip_address }}{% raw %}:5601/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'${__from:date}',to:'${__to:date}'))&_a=(columns:!(_source),filters:!(),interval:auto,query:(language:lucene,query:'log_level:${loglevel:lucene} AND programname:(\"${__data.fields[\"programname.keyword\"]}\") AND Hostname:${host:lucene}'),sort:!())" + } + ], + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Count" + }, + "properties": [ + { + "id": "custom.cellOptions", + "value": { + "type": "gauge" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "programname.keyword" + }, + "properties": [ + { + "id": "displayName", + "value": "Program Name" + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 24, + "x": 0, + "y": 84 + }, + "id": 8, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "alias": "", + "bucketAggs": [ + { + "field": "programname.keyword", + "id": "2", + "settings": { + "min_doc_count": "1", + "order": "desc", + "orderBy": "_count", + "size": "20" + }, + "type": "terms" + } + ], + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "format": "table", + "metrics": [ + { + "id": "1", + "type": "count" + } + ], + "query": "log_level:$loglevel AND programname:$program_name AND Hostname:$host", + "queryType": "lucene", + "refId": "A", + "timeField": "@timestamp" + } + ], + "title": "Number of $loglevel per service", + "type": "table" + }, + { + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "gridPos": { + "h": 9, + "w": 24, + "x": 0, + "y": 94 + }, + "id": 100, + "links": [ + { + "targetBlank": true, + "title": "View in Opensearch", + "url": "http{% endraw %}{{ 's' if kolla_enable_tls_internal | bool else '' }}://{{ kolla_internal_vip_address }}{% raw %}:5601/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'${__from:date}',to:'${__to:date}'))&_a=(columns:!(_source),filters:!(),interval:auto,query:(language:lucene,query:'log_level:${loglevel:lucene} AND programname:\"${program_name:lucene}\" AND Hostname:${host:lucene}'),sort:!())" + } + ], + "maxPerRow": 2, + "options": { + "dedupStrategy": "exact", + "enableLogDetails": true, + "prettifyLogMessage": true, + "showCommonLabels": false, + "showLabels": false, + "showTime": true, + "sortOrder": "Descending", + "wrapLogMessage": false + }, + "pluginVersion": "10.1.2", + "repeat": "program_name", + "repeatDirection": "v", + "targets": [ + { + "alias": "", + "bucketAggs": [ + { + "id": "2", + "settings": { + "min_doc_count": "0", + "order": "desc", + "orderBy": "_term", + "size": "10" + }, + "type": "terms" + } + ], + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "format": "logs", + "luceneQueryType": "Metric", + "metrics": [ + { + "id": "1", + "type": "logs" + } + ], + "query": "log_level:$loglevel AND programname:$program_name AND Hostname:$host", + "queryType": "lucene", + "refId": "A", + "timeField": "@timestamp" + } + ], + "title": "Logs - $loglevel - $program_name", + "transformations": [ + { + "id": "organize", + "options": { + "excludeByName": {}, + "includeByName": {}, + "indexByName": { + "@timestamp": 0, + "Hostname": 2, + "Payload": 1, + "_id": 3, + "_index": 4, + "_source": 5, + "_type": 6, + "level": 10, + "log_level": 7, + "payload": 9, + "programname": 8 + }, + "renameByName": {} + } + } + ], + "type": "logs" } ], "refresh": false, @@ -2493,6 +2860,112 @@ "refresh": 2, "skipUrlSync": false, "type": "interval" + }, + { + "current": { + "selected": true, + "text": [ + "ERROR" + ], + "value": [ + "ERROR" + ] + }, + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "definition": "{\"find\": \"terms\", \"query\": \"Hostname: ${host:lucene}\", \"field\": \"log_level.keyword\", \"size\": 1000}", + "hide": 0, + "includeAll": true, + "label": "Log Level", + "multi": true, + "name": "loglevel", + "options": [], + "query": "{\"find\": \"terms\", \"query\": \"Hostname: ${host:lucene}\", \"field\": \"log_level.keyword\", \"size\": 1000}", + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "grafana-opensearch-datasource", + "uid": "${os_datasource}" + }, + "definition": "{\"find\": \"terms\", \"query\": \"log_level: $loglevel AND Hostname: $host\", \"field\": \"programname.keyword\", \"size\": 1000}", + "hide": 0, + "includeAll": true, + "label": "Program Name", + "multi": true, + "name": "program_name", + "options": [], + "query": "{\"find\": \"terms\", \"query\": \"log_level: $loglevel AND Hostname: $host\", \"field\": \"programname.keyword\", \"size\": 1000}", + "refresh": 2, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "definition": "label_values(node_uname_info,nodename)", + "hide": 0, + "includeAll": true, + "label": "Host", + "multi": true, + "name": "host", + "options": [], + "query": { + "query": "label_values(node_uname_info,nodename)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 2, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "opensearch", + "value": "fdfos0ns7hce8f" + }, + "description": "Opensearch", + "hide": 0, + "includeAll": false, + "multi": false, + "name": "os_datasource", + "options": [], + "query": "grafana-opensearch-datasource", + "queryValue": "", + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "type": "datasource" } ] }, diff --git a/etc/kayobe/kolla/config/prometheus/prometheus.rules b/etc/kayobe/kolla/config/prometheus/prometheus.rules index c9803946a..20e1b303a 100644 --- a/etc/kayobe/kolla/config/prometheus/prometheus.rules +++ b/etc/kayobe/kolla/config/prometheus/prometheus.rules @@ -7,7 +7,7 @@ groups: rules: - alert: PrometheusTargetMissing - expr: up == 0 + expr: up{job!="redfish-exporter-seed"} == 0 for: 5m labels: severity: critical @@ -15,6 +15,15 @@ groups: summary: "Prometheus target missing (instance {{ $labels.instance }})" description: "A Prometheus target has disappeared. An exporter might have crashed." + - alert: PrometheusTargetMissing + expr: up{job="redfish-exporter-seed"} == 0 + for: 15m + labels: + severity: critical + annotations: + summary: "Prometheus target missing (instance {{ $labels.instance }})" + description: "A Prometheus target has disappeared. An exporter might have crashed." + - alert: PrometheusAllTargetsMissing expr: count by (job) (up) == 0 for: 1m diff --git a/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/60-redfish.yml b/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/60-redfish.yml index 84e85e04f..6f234e5a0 100644 --- a/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/60-redfish.yml +++ b/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/60-redfish.yml @@ -15,13 +15,11 @@ scrape_configs: replacement: "{{ lookup('vars', admin_oc_net_name ~ '_ips')[groups.seed.0] }}:9610" static_configs: {% for host in groups.get('redfish_exporter_targets', []) %} -{% if hostvars[host]["redfish_exporter_scrape_group"] | default('overcloud') == 'overcloud' %} - targets: - '{{ hostvars[host]["redfish_exporter_target_address"] }}' labels: server: '{{ host }}' env: "{{ kayobe_environment | default('openstack') }}" group: "{{ hostvars[host]['redfish_exporter_scrape_group'] | default('overcloud') }}" -{% endif %} {% endfor %} {% endif %} diff --git a/etc/kayobe/stackhpc-monitoring.yml b/etc/kayobe/stackhpc-monitoring.yml index 3e9fb107e..cfc909511 100644 --- a/etc/kayobe/stackhpc-monitoring.yml +++ b/etc/kayobe/stackhpc-monitoring.yml @@ -34,6 +34,9 @@ alertmanager_packet_errors_threshold: 1 # targets being templated during deployment. stackhpc_enable_os_capacity: true +# OpenStack Capacity exporter version +stackhpc_os_capacity_version: v0.5 + # Path to a CA certificate file to trust in the OpenStack Capacity exporter. stackhpc_os_capacity_openstack_cacert: "" diff --git a/releasenotes/notes/cis-hardening-no-longer-sets-permissions-on-logs-81a48ab8ed2d6b5f.yaml b/releasenotes/notes/cis-hardening-no-longer-sets-permissions-on-logs-81a48ab8ed2d6b5f.yaml new file mode 100644 index 000000000..e50b5b62b --- /dev/null +++ b/releasenotes/notes/cis-hardening-no-longer-sets-permissions-on-logs-81a48ab8ed2d6b5f.yaml @@ -0,0 +1,8 @@ +--- +fixes: + - | + The CIS hardening scripts no longer change permissions of log files by + default. It is preferred to configure these permissions at source i.e on + whatever is creating the files. It also suffered from a time-of-check to + time-of-use race condition. If you want the old behaviour you can change + ``rhel9cis_rule_4_2_3`` and/or ``ubtu22cis_rule_4_2_3`` to ``true``. diff --git a/releasenotes/notes/fix-issue-with-redfish-exporter-scrape-group-b10eaac6ee1e6af3.yaml b/releasenotes/notes/fix-issue-with-redfish-exporter-scrape-group-b10eaac6ee1e6af3.yaml new file mode 100644 index 000000000..1ee5a9a41 --- /dev/null +++ b/releasenotes/notes/fix-issue-with-redfish-exporter-scrape-group-b10eaac6ee1e6af3.yaml @@ -0,0 +1,6 @@ +--- +fixes: + - | + Fixes an issue where setting ``redfish_exporter_scrape_group`` to a value + other than ``overcloud`` would exclude those nodes from the redfish + exporter scrapes. diff --git a/releasenotes/notes/ipa-inspection-settings-133fe91b1d855fa0.yaml b/releasenotes/notes/ipa-inspection-settings-133fe91b1d855fa0.yaml new file mode 100644 index 000000000..cfb761290 --- /dev/null +++ b/releasenotes/notes/ipa-inspection-settings-133fe91b1d855fa0.yaml @@ -0,0 +1,5 @@ +--- +features: + - | + Configures the Ironic Python Agent with useful settings for inspection, + such as the ``extra-hardware`` and ``mellanox`` elements. diff --git a/releasenotes/notes/logs-in-openstack-dashboard-6e345ff7f16c0658.yaml b/releasenotes/notes/logs-in-openstack-dashboard-6e345ff7f16c0658.yaml new file mode 100644 index 000000000..0176ac636 --- /dev/null +++ b/releasenotes/notes/logs-in-openstack-dashboard-6e345ff7f16c0658.yaml @@ -0,0 +1,5 @@ +--- +features: + - | + The Openstack Dashboard in Grafana now includes logs from Openstack + services. diff --git a/releasenotes/notes/reduces-sensitivity-of-redfish-target-alerts-a3d77a3f0c3dac8a.yaml b/releasenotes/notes/reduces-sensitivity-of-redfish-target-alerts-a3d77a3f0c3dac8a.yaml new file mode 100644 index 000000000..0ba59ea7a --- /dev/null +++ b/releasenotes/notes/reduces-sensitivity-of-redfish-target-alerts-a3d77a3f0c3dac8a.yaml @@ -0,0 +1,6 @@ +--- +fixes: + - | + Changes the duration for which redfish exporter must continually fail + scrapes before triggering an alert to 15 minutes. This should hopefully + reduce some alert spam. diff --git a/releasenotes/notes/update-blazar-image-d176c27d55716469.yaml b/releasenotes/notes/update-blazar-image-d176c27d55716469.yaml new file mode 100644 index 000000000..7e53b3543 --- /dev/null +++ b/releasenotes/notes/update-blazar-image-d176c27d55716469.yaml @@ -0,0 +1,5 @@ +--- +features: + - | + Use the StackHPC fork for building Blazar images with customizations to support + flavor-based reservation.