From 08efecb487dc974ca6a6f0b5df093c1ea7eec3c1 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 19 Jun 2023 22:50:15 -0700 Subject: [PATCH 01/13] Initial draft for Voltage monitor --- doc/pmon/pmon-thermalctld-vmon.md | 120 ++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 doc/pmon/pmon-thermalctld-vmon.md diff --git a/doc/pmon/pmon-thermalctld-vmon.md b/doc/pmon/pmon-thermalctld-vmon.md new file mode 100644 index 0000000000..ad97508933 --- /dev/null +++ b/doc/pmon/pmon-thermalctld-vmon.md @@ -0,0 +1,120 @@ +# SONiC PMON Voltage Monitoring Enhancement # + + +### Revision 1.0 + +## Table of Content +1. [Scope](#Scope) +2. [Definitions](#Definitions/Abbreviations) +3. [Overview](#Overview) +3. [Requirements](#Requirements) +4. [High-Level Design](#High-Level Design) +5. [CLI](#CLI/YANG model Enhancements) +7. [Test](#Testing Requirements/Design) + + + +### Scope + +This document covers the support for monitoring voltage sensor devices in SONiC. + +### Definitions/Abbreviations + +PMON - Platform Monitor container in SONiC. + +PSU - Power Supply Unit + +Voltage Sensor - Sensor device which can report a voltage measurement in the system. + +### Overview + +Modern hardware systems have a number of sensor and control devices. Voltage and current sensor devices can measure and in some cases control the voltages on the boards. Many systems have a number of such devices to distribute power across the different parts of the system including motherboard, daughterboards etc. These devices should be monitored to alert the operator about any failures that might affect the operation of the system. This document provides an overview of how the voltage sensors can be modeled and monitored in SONiC. + + +### Requirements + +This HLD covers + +* Discovery of voltage sensor devices in the system +* Monitoring the voltage sensor devices periodically and update that data in Redis DB +* Raise/Clear alarms if the voltage sensor devices indicate readings which are unexpected and clear them if they return to a good state. + +This HLD does not cover + +* An automated recovery action that system might take as a result of a fault reported by the voltage sensor device. A network management system may process the alarms and take recvoery action as it sees fit. + +### High-Level Design + +The proposal for monitoring voltage sensor devices is to enhance the PMON ThermalCtld Daemon functionality. ThermalCtld monitors the temperature sensors and uses that information to control fan speed in the system. For this purpose ThermalCtld discovers the temperature sensor devices and periodically polls them to collect their information. This mechanism can be extended to monitoring the voltage sensor devices as well. + +ThermalCtld will discover the voltage sensor devices in the system on bootup and update their information in StateDB. Subsequently it will periodically poll the voltage sensor devices and publish the readings in RedisDB. If the sensor device readings cross the minor/major/critical thresholds, syslogs will be generated to indicate to the operator about the alarm condition. If the sensor reports normal data in a subsequent poll, another syslog will be generated to indicate that the fault condition is cleared. + +CLI is provided to display the voltage sensor devices, their current measurements and threshold values. + +Platform APIs will provide + +* List of voltage sensors devices in the system +* Way to read the voltage sensor devices + +The following SONiC repositories will have changes + +####sonic-platform-daemons + +Thermaltcld script will retrieve voltage sensors data from the platform on coming up and poll for refreshing the data periodically. + +####sonic-platform-common + +Chassis Base class will be enhanced with prototype methods for retrieving number of voltage sensors and voltage sensor objects. + +Module base class will also be enhanced with similar methods for retrieving voltage sensors present on the modules. + +A new base class - VsensorBase - is introduced for voltage sensor objects. The class will have methods to retrieve threshold information, sensor value and min/max recorded values from the sensor. + +####sonic-utilities + +CLI is introduced to retrieve and display voltage sensor data from State DB. + + +### CLI/YANG model Enhancements + +Following CLI is introduced to display the Voltage Sensor devices. + + root@sonic:/home/cisco# show platform voltage + Sensor Voltage(mV) High TH Low TH Crit High TH Crit Low TH Warning Timestamp + ---------------- ------------- --------- -------- -------------- ------------- --------- ----------------- + VP0P75_CORE_NPU0 750 852 684 872 664 False 20230204 11:35:21 + VP0P75_CORE_NPU1 750 852 684 872 664 False 20230204 11:35:21 + VP0P75_CORE_NPU2 750 852 684 872 664 False 20230204 11:35:22 + ... + + +#### Configuration and Management + +At this point, there is no configuration requirement for this feature. If the platform does not specify any voltage sensors in the thermal_zone.yaml file, there will be no voltage sensor information reported. + +It is advised to monitor the voltage sensor alarms and use that to debug and identify any issues in the system. In the event, a voltage sensor crosses a high or low threshold, syslogs will be raised indicating the alarm. + + Feb 4 09:11:24.669278 sonic WARNING pmon#thermalctld: High voltage warning: VP0P75_CORE_NPU3 current voltage 750C, high threshold 720C + +The alarm condition will also be visible in the CLI ouput. + + e.g + root@sonic:/home/cisco# show platform voltage + Sensor Voltage(mV) High TH Low TH Crit High TH Crit Low TH Warning Timestamp + ---------------- ------------- --------- -------- -------------- ------------- --------- ----------------- + VP0P75_CORE_NPU3 750 720 684 720 664 True 20230204 11:35:22 + + +#### Warmboot and Fastboot Design Impact + +Warmboot and Fastboot should not be impacted by this feature. On PMON container restart, the voltage monitoring should restart the same way as on boot. + +### Testing Requirements/Design + +Unit test cases cover the CLI and voltage monitoring aspect. + +Voltage threshold crossing can be simulated by adjusting the thresholds in thermal_zone.yaml file. This can be used to check syslog and alarm indications. + +#### Unit Test cases + +TBD \ No newline at end of file From edd84d3f3d1e87fc426994fb5a5d8422ed5fcc45 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 19 Jun 2023 22:58:37 -0700 Subject: [PATCH 02/13] Fixed a link. --- doc/pmon/pmon-thermalctld-vmon.md | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/pmon/pmon-thermalctld-vmon.md b/doc/pmon/pmon-thermalctld-vmon.md index ad97508933..d31e9ab374 100644 --- a/doc/pmon/pmon-thermalctld-vmon.md +++ b/doc/pmon/pmon-thermalctld-vmon.md @@ -43,6 +43,7 @@ This HLD does not cover * An automated recovery action that system might take as a result of a fault reported by the voltage sensor device. A network management system may process the alarms and take recvoery action as it sees fit. + ### High-Level Design The proposal for monitoring voltage sensor devices is to enhance the PMON ThermalCtld Daemon functionality. ThermalCtld monitors the temperature sensors and uses that information to control fan speed in the system. For this purpose ThermalCtld discovers the temperature sensor devices and periodically polls them to collect their information. This mechanism can be extended to monitoring the voltage sensor devices as well. From 83014e44cdcd50d5d2e891a96bcc174c8d8fcc3a Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 19 Jun 2023 23:22:56 -0700 Subject: [PATCH 03/13] Fix more links --- doc/pmon/pmon-thermalctld-vmon.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/pmon/pmon-thermalctld-vmon.md b/doc/pmon/pmon-thermalctld-vmon.md index d31e9ab374..e9ca39ea8a 100644 --- a/doc/pmon/pmon-thermalctld-vmon.md +++ b/doc/pmon/pmon-thermalctld-vmon.md @@ -8,9 +8,9 @@ 2. [Definitions](#Definitions/Abbreviations) 3. [Overview](#Overview) 3. [Requirements](#Requirements) -4. [High-Level Design](#High-Level Design) -5. [CLI](#CLI/YANG model Enhancements) -7. [Test](#Testing Requirements/Design) +4. [High Level Design](#High-Level-Design) +5. [CLI](#CLI/YANG-model-Enhancements) +7. [Test](#Testing-Requirements/Design) @@ -44,7 +44,7 @@ This HLD does not cover * An automated recovery action that system might take as a result of a fault reported by the voltage sensor device. A network management system may process the alarms and take recvoery action as it sees fit. -### High-Level Design +### High Level Design The proposal for monitoring voltage sensor devices is to enhance the PMON ThermalCtld Daemon functionality. ThermalCtld monitors the temperature sensors and uses that information to control fan speed in the system. For this purpose ThermalCtld discovers the temperature sensor devices and periodically polls them to collect their information. This mechanism can be extended to monitoring the voltage sensor devices as well. From 3a586ce5ee214f1db887416d68dd9cc4fb4e4540 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 19 Jun 2023 23:32:40 -0700 Subject: [PATCH 04/13] Fix formatting --- doc/pmon/pmon-thermalctld-vmon.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/pmon/pmon-thermalctld-vmon.md b/doc/pmon/pmon-thermalctld-vmon.md index e9ca39ea8a..6956c7faf2 100644 --- a/doc/pmon/pmon-thermalctld-vmon.md +++ b/doc/pmon/pmon-thermalctld-vmon.md @@ -9,8 +9,8 @@ 3. [Overview](#Overview) 3. [Requirements](#Requirements) 4. [High Level Design](#High-Level-Design) -5. [CLI](#CLI/YANG-model-Enhancements) -7. [Test](#Testing-Requirements/Design) +5. [CLI](#CLI-Enhancements) +7. [Test](#Testing-Considerations) @@ -59,11 +59,11 @@ Platform APIs will provide The following SONiC repositories will have changes -####sonic-platform-daemons +#### sonic-platform-daemons Thermaltcld script will retrieve voltage sensors data from the platform on coming up and poll for refreshing the data periodically. -####sonic-platform-common +#### sonic-platform-common Chassis Base class will be enhanced with prototype methods for retrieving number of voltage sensors and voltage sensor objects. @@ -71,12 +71,12 @@ Module base class will also be enhanced with similar methods for retrieving volt A new base class - VsensorBase - is introduced for voltage sensor objects. The class will have methods to retrieve threshold information, sensor value and min/max recorded values from the sensor. -####sonic-utilities +#### sonic-utilities CLI is introduced to retrieve and display voltage sensor data from State DB. -### CLI/YANG model Enhancements +### CLI Enhancements Following CLI is introduced to display the Voltage Sensor devices. @@ -110,7 +110,7 @@ The alarm condition will also be visible in the CLI ouput. Warmboot and Fastboot should not be impacted by this feature. On PMON container restart, the voltage monitoring should restart the same way as on boot. -### Testing Requirements/Design +### Testing Considerations Unit test cases cover the CLI and voltage monitoring aspect. From 781898784d915ebaf284b5b01074d47d12761580 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Tue, 27 Jun 2023 12:29:24 -0700 Subject: [PATCH 05/13] Added SensorMon Changes --- doc/pmon/pmon-thermalctld-vmon.md | 70 +++++++++++++++++++++---------- 1 file changed, 47 insertions(+), 23 deletions(-) diff --git a/doc/pmon/pmon-thermalctld-vmon.md b/doc/pmon/pmon-thermalctld-vmon.md index 6956c7faf2..7ae9112aae 100644 --- a/doc/pmon/pmon-thermalctld-vmon.md +++ b/doc/pmon/pmon-thermalctld-vmon.md @@ -1,4 +1,4 @@ -# SONiC PMON Voltage Monitoring Enhancement # +# SONiC PMON Sensor Monitoring Enhancement # ### Revision 1.0 @@ -16,7 +16,7 @@ ### Scope -This document covers the support for monitoring voltage sensor devices in SONiC. +This document covers the support for monitoring voltage and current sensor devices in SONiC. ### Definitions/Abbreviations @@ -26,18 +26,26 @@ PSU - Power Supply Unit Voltage Sensor - Sensor device which can report a voltage measurement in the system. +Current Sensor - Sensor device which can report current measurement in the system. + +Altitude Sensor - Sensor device which can report the altitude of the system. + + ### Overview -Modern hardware systems have a number of sensor and control devices. Voltage and current sensor devices can measure and in some cases control the voltages on the boards. Many systems have a number of such devices to distribute power across the different parts of the system including motherboard, daughterboards etc. These devices should be monitored to alert the operator about any failures that might affect the operation of the system. This document provides an overview of how the voltage sensors can be modeled and monitored in SONiC. +Modern hardware systems have many different types of sensors and control devices. Voltage sensor devices can measure and in some cases control the voltages on the boards. Current sensor devices can measure current. It is also possible to have other types of sensors such as Altitude Sensors etc. These devices can report measurements from different parts of the system which are useful for monitoring system health. For example, voltage controller devices distribute power across different parts of the system such as motherboard, daughterboards etc. and can report voltage measurements from there. Often these devices can report under-voltage/over-voltage faults which should be monitored to alert the operator about any power related failures in the system. This document provides an overview for monitoring voltage and current sensors in SONiC. The solution proposed in this document can be enhanced for other types of sensors as well. + +Note that temperature sensor devices are managed via SONiC ThermalCtlD daemon today. At this pointIt is possible to move the temperature sensors monitoring to the proposed model here while keeping the Fan control algorithm in ThermalCtlD. However that is not discussed in this document and might be taken up in future work. ### Requirements This HLD covers -* Discovery of voltage sensor devices in the system -* Monitoring the voltage sensor devices periodically and update that data in Redis DB -* Raise/Clear alarms if the voltage sensor devices indicate readings which are unexpected and clear them if they return to a good state. +* Discovery of voltage and current sensor devices in the system +* Monitoring the sensor devices periodically and update that data in Redis DB +* Raise/Clear alarms if the sensor devices indicate readings which are unexpected and clear them if they return to a good state. +* A framework for adding new sensor types in future This HLD does not cover @@ -46,39 +54,45 @@ This HLD does not cover ### High Level Design -The proposal for monitoring voltage sensor devices is to enhance the PMON ThermalCtld Daemon functionality. ThermalCtld monitors the temperature sensors and uses that information to control fan speed in the system. For this purpose ThermalCtld discovers the temperature sensor devices and periodically polls them to collect their information. This mechanism can be extended to monitoring the voltage sensor devices as well. +The proposal for monitoring sensor devices is to create a new SensorMon daemon. SensorMon will use API provided by platform to discover the sensor devices. It will periodically poll the devices to refresh the sensor information and update the readings in StateDB. -ThermalCtld will discover the voltage sensor devices in the system on bootup and update their information in StateDB. Subsequently it will periodically poll the voltage sensor devices and publish the readings in RedisDB. If the sensor device readings cross the minor/major/critical thresholds, syslogs will be generated to indicate to the operator about the alarm condition. If the sensor reports normal data in a subsequent poll, another syslog will be generated to indicate that the fault condition is cleared. +If the sensor device readings cross the minor/major/critical thresholds, syslogs will be generated to indicate to the operator about the alarm condition. If the sensor reports normal data in a subsequent poll, another syslog will be generated to indicate that the fault condition is cleared. -CLI is provided to display the voltage sensor devices, their current measurements and threshold values. +CLIs are provided to display the sensor devices, their current measurements, threshold values and if they are reporting an alarm. Platform APIs will provide -* List of voltage sensors devices in the system -* Way to read the voltage sensor devices +* List of sensors devices of a specific type in the system +* Way to read the sensor information The following SONiC repositories will have changes #### sonic-platform-daemons -Thermaltcld script will retrieve voltage sensors data from the platform on coming up and poll for refreshing the data periodically. +SensorMon will retrieve a list of sensors of different sensor types from the platform during initialization. Subsequently it will poll the sensor readings on a periodic basis and update the data in StateDb. + #### sonic-platform-common -Chassis Base class will be enhanced with prototype methods for retrieving number of voltage sensors and voltage sensor objects. +Chassis Base class will be enhanced with prototype methods for retrieving number of sensors and sensor objects of a specific type. + +Module base class will also be enhanced with similar methods for retrieving sensors present on the modules. -Module base class will also be enhanced with similar methods for retrieving voltage sensors present on the modules. +New base classes will be introduced for new sensor types. -A new base class - VsensorBase - is introduced for voltage sensor objects. The class will have methods to retrieve threshold information, sensor value and min/max recorded values from the sensor. +VsensorBase is introduced for voltage sensor objects. +IsensorBase is introdued for current sensor objects. + +The classes will have methods to retrieve threshold information, sensor value and min/max recorded values from the sensor. #### sonic-utilities -CLI is introduced to retrieve and display voltage sensor data from State DB. +CLIs are introduced to retrieve and display sensor data from State DB for different sensor types. CLIs are described in the next section. ### CLI Enhancements -Following CLI is introduced to display the Voltage Sensor devices. +Following CLI is introduced to display the Voltage and Current Sensor devices. root@sonic:/home/cisco# show platform voltage Sensor Voltage(mV) High TH Low TH Crit High TH Crit Low TH Warning Timestamp @@ -88,14 +102,25 @@ Following CLI is introduced to display the Voltage Sensor devices. VP0P75_CORE_NPU2 750 852 684 872 664 False 20230204 11:35:22 ... + root@sonic:/home/cisco# show platform current + Sensor Current(mA) High TH Low TH Crit High TH Crit Low TH Warning Timestamp + -------------- ------------- --------- -------- -------------- ------------- --------- ----------------- + POL_CORE_N0_I0 25000 30000 18000 28000 15000 False 20230212 11:18:28 + POL_CORE_N0_I1 21562 30000 18000 28000 15000 False 20230212 11:18:28 + POL_CORE_N0_I2 22250 30000 18000 28000 15000 False 20230212 11:18:28 + + + #### Configuration and Management -At this point, there is no configuration requirement for this feature. If the platform does not specify any voltage sensors in the thermal_zone.yaml file, there will be no voltage sensor information reported. +At this point, there is no configuration requirement for this feature. + +If the SensorMon daemon is not desired to be run in the system, an entry can be added to pmon_daemon_control.json to exclude it from running in the system. -It is advised to monitor the voltage sensor alarms and use that to debug and identify any issues in the system. In the event, a voltage sensor crosses a high or low threshold, syslogs will be raised indicating the alarm. +It is advised to monitor the sensor alarms and use that to debug and identify any issues in the system. In the event, a sensor crosses a high or low threshold, syslogs will be raised indicating the alarm. - Feb 4 09:11:24.669278 sonic WARNING pmon#thermalctld: High voltage warning: VP0P75_CORE_NPU3 current voltage 750C, high threshold 720C + Feb 4 09:11:24.669278 sonic WARNING pmon#sensormon: High voltage warning: VP0P75_CORE_NPU3 current voltage 750C, high threshold 720C The alarm condition will also be visible in the CLI ouput. @@ -108,13 +133,12 @@ The alarm condition will also be visible in the CLI ouput. #### Warmboot and Fastboot Design Impact -Warmboot and Fastboot should not be impacted by this feature. On PMON container restart, the voltage monitoring should restart the same way as on boot. +Warmboot and Fastboot should not be impacted by this feature. On PMON container restart, the sensor monitoring should restart the same way as on boot. ### Testing Considerations -Unit test cases cover the CLI and voltage monitoring aspect. +Unit test cases cover the CLI and sensor monitoring aspect. -Voltage threshold crossing can be simulated by adjusting the thresholds in thermal_zone.yaml file. This can be used to check syslog and alarm indications. #### Unit Test cases From 0f98855596acbbee3f29d4ea1434b2e686592704 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Tue, 27 Jun 2023 12:30:55 -0700 Subject: [PATCH 06/13] Changed daemon name --- doc/pmon/{pmon-thermalctld-vmon.md => pmon-sensormon.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename doc/pmon/{pmon-thermalctld-vmon.md => pmon-sensormon.md} (100%) diff --git a/doc/pmon/pmon-thermalctld-vmon.md b/doc/pmon/pmon-sensormon.md similarity index 100% rename from doc/pmon/pmon-thermalctld-vmon.md rename to doc/pmon/pmon-sensormon.md From 2991096bf753d22f771183426e6cef26fcf09c62 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Tue, 27 Jun 2023 22:02:51 -0700 Subject: [PATCH 07/13] Added internal review comments. --- doc/pmon/pmon-sensormon.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md index 7ae9112aae..dd21d43ca0 100644 --- a/doc/pmon/pmon-sensormon.md +++ b/doc/pmon/pmon-sensormon.md @@ -35,7 +35,7 @@ Altitude Sensor - Sensor device which can report the altitude of the system. Modern hardware systems have many different types of sensors and control devices. Voltage sensor devices can measure and in some cases control the voltages on the boards. Current sensor devices can measure current. It is also possible to have other types of sensors such as Altitude Sensors etc. These devices can report measurements from different parts of the system which are useful for monitoring system health. For example, voltage controller devices distribute power across different parts of the system such as motherboard, daughterboards etc. and can report voltage measurements from there. Often these devices can report under-voltage/over-voltage faults which should be monitored to alert the operator about any power related failures in the system. This document provides an overview for monitoring voltage and current sensors in SONiC. The solution proposed in this document can be enhanced for other types of sensors as well. -Note that temperature sensor devices are managed via SONiC ThermalCtlD daemon today. At this pointIt is possible to move the temperature sensors monitoring to the proposed model here while keeping the Fan control algorithm in ThermalCtlD. However that is not discussed in this document and might be taken up in future work. +Note that temperature sensor devices are managed via SONiC ThermalCtlD daemon today. At this point there is no change proposed for ThermalCtlD. This proposed design can be used for voltage, current and other types of sensors. ### Requirements @@ -58,7 +58,7 @@ The proposal for monitoring sensor devices is to create a new SensorMon daemon. If the sensor device readings cross the minor/major/critical thresholds, syslogs will be generated to indicate to the operator about the alarm condition. If the sensor reports normal data in a subsequent poll, another syslog will be generated to indicate that the fault condition is cleared. -CLIs are provided to display the sensor devices, their current measurements, threshold values and if they are reporting an alarm. +CLIs are provided to display the sensor devices, their measurements, threshold values and if they are reporting an alarm. Platform APIs will provide @@ -69,7 +69,7 @@ The following SONiC repositories will have changes #### sonic-platform-daemons -SensorMon will retrieve a list of sensors of different sensor types from the platform during initialization. Subsequently it will poll the sensor readings on a periodic basis and update the data in StateDb. +SensorMon will be a new daemon that will run in PMON container. It will retrieve a list of sensors of different sensor types from the platform during initialization. Subsequently, it will poll the sensor devices on a periodic basis and update their measurments in StateDb. SemsorMon will also raise syslogs on alarm conditions. #### sonic-platform-common From 44c46e62e22fe77cabc14fddb525c7102492705a Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 17 Jul 2023 16:35:38 -0700 Subject: [PATCH 08/13] Addressed review comment. --- doc/pmon/pmon-sensormon.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md index dd21d43ca0..7f73ce353d 100644 --- a/doc/pmon/pmon-sensormon.md +++ b/doc/pmon/pmon-sensormon.md @@ -37,6 +37,14 @@ Modern hardware systems have many different types of sensors and control devices Note that temperature sensor devices are managed via SONiC ThermalCtlD daemon today. At this point there is no change proposed for ThermalCtlD. This proposed design can be used for voltage, current and other types of sensors. +Linux does provide some support of voltage and current sensor monitoring using lmsensors/hwmon infrastructure. However there are a few limitations with that + +- Devices not supported with Hwmon are not covered +- Simple devices which donot have an inbuilt monitoring functions do not generate any alarms +- Platform specific thresholds for monitoring are not available + +The solution proposed in this document tries to address these limitations by extending the coverage to a larger set of devices and providing platform specific thresholds for sensor monitoring. + ### Requirements From 8a971fdd17b88bd4d72bb90eb15f13b93fd6753d Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Fri, 8 Sep 2023 12:21:39 -0700 Subject: [PATCH 09/13] Review comments --- doc/pmon/pmon-sensormon.md | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md index 7f73ce353d..a3ed003a46 100644 --- a/doc/pmon/pmon-sensormon.md +++ b/doc/pmon/pmon-sensormon.md @@ -53,6 +53,8 @@ This HLD covers * Discovery of voltage and current sensor devices in the system * Monitoring the sensor devices periodically and update that data in Redis DB * Raise/Clear alarms if the sensor devices indicate readings which are unexpected and clear them if they return to a good state. +* Report sensor alarm conditions in system health +* Enable such sensors in Entity MIB * A framework for adding new sensor types in future This HLD does not cover @@ -97,6 +99,13 @@ The classes will have methods to retrieve threshold information, sensor value an CLIs are introduced to retrieve and display sensor data from State DB for different sensor types. CLIs are described in the next section. +#### sonic-buildimage + +The CLI "show system-health" should report sensor fault conditions. Hardware health check script will need enhancement to retrieve sensor data from StateDB. + +#### sonic-snmpagent + +Voltage and Current sensors should be available in Entity MIB. Entity MIB implementation will need an enhancement to retrieve voltage and current sensors from the state DB. ### CLI Enhancements @@ -128,16 +137,28 @@ If the SensorMon daemon is not desired to be run in the system, an entry can be It is advised to monitor the sensor alarms and use that to debug and identify any issues in the system. In the event, a sensor crosses a high or low threshold, syslogs will be raised indicating the alarm. - Feb 4 09:11:24.669278 sonic WARNING pmon#sensormon: High voltage warning: VP0P75_CORE_NPU3 current voltage 750C, high threshold 720C + FJul 27 08:26:32.561330 sonic WARNING pmon#sensormond: High voltage warning: VP0P75_CORE_NPU2 current voltage 880mV, high threshold 856mV -The alarm condition will also be visible in the CLI ouput. +The alarm condition will be visible in the CLI ouputs for sensor data and system health. e.g root@sonic:/home/cisco# show platform voltage Sensor Voltage(mV) High TH Low TH Crit High TH Crit Low TH Warning Timestamp ---------------- ------------- --------- -------- -------------- ------------- --------- ----------------- - VP0P75_CORE_NPU3 750 720 684 720 664 True 20230204 11:35:22 + VP0P75_CORE_NPU2 880 720 684 720 664 True 20230204 11:35:22 + root@sonic:/home/cisco# show system-health detail + System status summary + ... + Hardware: + Status: Not OK + Reasons: Voltage sensor VP0P75_CORE_NPU2 measurement 880 mV out of range (679,856) + ... + VP0P75_CORE_NPU2 Not OK voltage + ... + + + #### Warmboot and Fastboot Design Impact From 2a0353a72ad3e2fb1aee8fd5a2104c4326901291 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 11 Sep 2023 09:52:18 -0700 Subject: [PATCH 10/13] Addressed review comment --- doc/pmon/pmon-sensormon.md | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md index a3ed003a46..63aa0ebe14 100644 --- a/doc/pmon/pmon-sensormon.md +++ b/doc/pmon/pmon-sensormon.md @@ -166,9 +166,16 @@ Warmboot and Fastboot should not be impacted by this feature. On PMON container ### Testing Considerations -Unit test cases cover the CLI and sensor monitoring aspect. +Unit test cases cover the CLI and sensor monitoring aspects. All SONiC common repos will have unit tests and meet code coverage requirements for the respective repos. In addition SONiC management tests will cover the feature on the target. +### Feature availability -#### Unit Test cases +The core implementation for the daemon process is available at this time along with the HLD. This includes changes in the following repos -TBD \ No newline at end of file +* Sonic-platform-daemons +* sonic-platform-common +* sonic-utilities +* sonic-buildimage + + +SNMP, System Health and SONiC management test cases will be available in the next phase of development. From 8420d6ff09023081fc7fa468b68a23445d00a9c1 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Fri, 27 Oct 2023 12:28:21 -0700 Subject: [PATCH 11/13] Addressed review comments --- doc/pmon/pmon-sensormon.md | 66 ++++++++++++++++++++++++++++++-------- 1 file changed, 52 insertions(+), 14 deletions(-) diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md index 63aa0ebe14..240154f196 100644 --- a/doc/pmon/pmon-sensormon.md +++ b/doc/pmon/pmon-sensormon.md @@ -79,9 +79,43 @@ The following SONiC repositories will have changes #### sonic-platform-daemons -SensorMon will be a new daemon that will run in PMON container. It will retrieve a list of sensors of different sensor types from the platform during initialization. Subsequently, it will poll the sensor devices on a periodic basis and update their measurments in StateDb. SemsorMon will also raise syslogs on alarm conditions. - +SensorMon will be a new daemon that will run in PMON container. It will retrieve a list of sensors of different sensor types from the platform during initialization. Subsequently, it will poll the sensor devices on a periodic basis and update their measurments in StateDb. SensorMon will also raise syslogs on alarm conditions. + +Following is the DB schema for voltage and current sensor data. + +##### Voltage Sensor StateDb Schema + + ; Defines information for a voltage sensor + key = VOLTAGE_INFO|sensor_name ; Voltage sensor name + ; field = value + voltage = float : Voltage measurement + unit = string ; Unit for the measurement + high_threshold = float ; High threshold + low_threshold = float ; Low threshold + critical_high_threshold = float ; Critical high threshold + critical_low_threshold = float ; Critical low threshold + warning_status = boolean ; Sensor value in range + timestamp = string ; Last update time + maximum_voltage = float ; Maximum recorded measurement + minimum_voltage = float ; Mininum recorded measurement + +##### Current Sensor StateDb Schema + + ; Defines information for a current sensor + key = CURRENT_INFO|sensor_name ; Current sensor name + ; field = value + current = float : Current measurement + unit = string ; Unit for the measurement + high_threshold = float ; High threshold + low_threshold = float ; Low threshold + critical_high_threshold = float ; Critical high threshold + critical_low_threshold = float ; Critical low threshold + warning_status = boolean ; Sensor value in range + timestamp = string ; Last update time + maximum_current = float ; Maximum recorded measurement + minimum_current = float ; Mininum recorded measurement + #### sonic-platform-common Chassis Base class will be enhanced with prototype methods for retrieving number of sensors and sensor objects of a specific type. @@ -112,19 +146,20 @@ Voltage and Current sensors should be available in Entity MIB. Entity MIB implem Following CLI is introduced to display the Voltage and Current Sensor devices. root@sonic:/home/cisco# show platform voltage - Sensor Voltage(mV) High TH Low TH Crit High TH Crit Low TH Warning Timestamp + Sensor Voltage High TH Low TH Crit High TH Crit Low TH Warning Timestamp ---------------- ------------- --------- -------- -------------- ------------- --------- ----------------- - VP0P75_CORE_NPU0 750 852 684 872 664 False 20230204 11:35:21 - VP0P75_CORE_NPU1 750 852 684 872 664 False 20230204 11:35:21 - VP0P75_CORE_NPU2 750 852 684 872 664 False 20230204 11:35:22 + VP0P75_CORE_NPU0 750 mV 852 684 872 664 False 20230204 11:35:21 + VP0P75_CORE_NPU1 750 mV 852 684 872 664 False 20230204 11:35:21 + VP0P75_CORE_NPU2 750 mV 852 684 872 664 False 20230204 11:35:22 + ... root@sonic:/home/cisco# show platform current - Sensor Current(mA) High TH Low TH Crit High TH Crit Low TH Warning Timestamp + Sensor Current High TH Low TH Crit High TH Crit Low TH Warning Timestamp -------------- ------------- --------- -------- -------------- ------------- --------- ----------------- - POL_CORE_N0_I0 25000 30000 18000 28000 15000 False 20230212 11:18:28 - POL_CORE_N0_I1 21562 30000 18000 28000 15000 False 20230212 11:18:28 - POL_CORE_N0_I2 22250 30000 18000 28000 15000 False 20230212 11:18:28 + POL_CORE_N0_I0 25000 mA 30000 18000 28000 15000 False 20230212 11:18:28 + POL_CORE_N0_I1 21562 mA 30000 18000 28000 15000 False 20230212 11:18:28 + POL_CORE_N0_I2 22250 mA 30000 18000 28000 15000 False 20230212 11:18:28 @@ -143,9 +178,9 @@ The alarm condition will be visible in the CLI ouputs for sensor data and system e.g root@sonic:/home/cisco# show platform voltage - Sensor Voltage(mV) High TH Low TH Crit High TH Crit Low TH Warning Timestamp + Sensor Voltage High TH Low TH Crit High TH Crit Low TH Warning Timestamp ---------------- ------------- --------- -------- -------------- ------------- --------- ----------------- - VP0P75_CORE_NPU2 880 720 684 720 664 True 20230204 11:35:22 + VP0P75_CORE_NPU2 880 mV 720 684 720 664 True 20230204 11:35:22 root@sonic:/home/cisco# show system-health detail System status summary @@ -158,7 +193,10 @@ The alarm condition will be visible in the CLI ouputs for sensor data and system ... +##### PDDF Support +SONiC PDDF provides a data driven framework to access platform HW devices. PDDF allows for sensor access information to be read from platform specific json files. PDDF support can be added for voltage and current sensors which can be retrieved by Sensormon. + #### Warmboot and Fastboot Design Impact @@ -176,6 +214,6 @@ The core implementation for the daemon process is available at this time along w * sonic-platform-common * sonic-utilities * sonic-buildimage +* sonic-snmpagent - -SNMP, System Health and SONiC management test cases will be available in the next phase of development. +SONiC management test cases and PDDF support will be available in the next phase of development. From a3390e0f70ba6a203ad01387a0ac459152708b07 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 11 Dec 2023 21:40:01 -0800 Subject: [PATCH 12/13] Added support for file based sensor config. --- doc/pmon/pmon-sensormon.md | 39 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md index 240154f196..020739d39e 100644 --- a/doc/pmon/pmon-sensormon.md +++ b/doc/pmon/pmon-sensormon.md @@ -192,7 +192,46 @@ The alarm condition will be visible in the CLI ouputs for sensor data and system VP0P75_CORE_NPU2 Not OK voltage ... + +#####Platform Sensors Configuration + +Sensormond will use the platform APIs for retrieving platform sensor information. However, for platforms with only file-system/sysfs based drivers, a simple implementation is provided wherein the platform can specify the sensor information for the board and any submodules (such as fabric cards) in a data file and Sensormond can use that for finding sensors and monitoring them. + +The file system/Sysfs based platform sensor information can be provided using a yaml file. The yaml file shall have the following format. + + + sensors.yaml + voltage_sensors: + - name : + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + current_sensors: + - name : + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + : + voltage_sensors: + - name: + sensor: +  high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + current_sensors: + - name: + sensor: +  high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + ##### PDDF Support SONiC PDDF provides a data driven framework to access platform HW devices. PDDF allows for sensor access information to be read from platform specific json files. PDDF support can be added for voltage and current sensors which can be retrieved by Sensormon. From 3ed3cca2866e3647b8af9871b33dc53b5dbae1a6 Mon Sep 17 00:00:00 2001 From: Mridul Bajpai Date: Mon, 11 Dec 2023 22:01:23 -0800 Subject: [PATCH 13/13] Fixed formatting --- doc/pmon/pmon-sensormon.md | 58 ++++++++++++++++++++------------------ 1 file changed, 30 insertions(+), 28 deletions(-) diff --git a/doc/pmon/pmon-sensormon.md b/doc/pmon/pmon-sensormon.md index 020739d39e..31ec5c8be5 100644 --- a/doc/pmon/pmon-sensormon.md +++ b/doc/pmon/pmon-sensormon.md @@ -193,7 +193,7 @@ The alarm condition will be visible in the CLI ouputs for sensor data and system ... -#####Platform Sensors Configuration +##### Platform Sensors Configuration Sensormond will use the platform APIs for retrieving platform sensor information. However, for platforms with only file-system/sysfs based drivers, a simple implementation is provided wherein the platform can specify the sensor information for the board and any submodules (such as fabric cards) in a data file and Sensormond can use that for finding sensors and monitoring them. @@ -201,35 +201,37 @@ The file system/Sysfs based platform sensor information can be provided using a sensors.yaml - - voltage_sensors: - - name : - sensor: - high_thresholds: [ , , ] - low_thresholds: [ , , ] - ... - + + voltage_sensors: + - name : + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + current_sensors: - - name : - sensor: - high_thresholds: [ , , ] - low_thresholds: [ , , ] - ... - + - name : + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + : - voltage_sensors: - - name: - sensor: -  high_thresholds: [ , , ] - low_thresholds: [ , , ] - ... - - current_sensors: - - name: - sensor: -  high_thresholds: [ , , ] - low_thresholds: [ , , ] - ... + voltage_sensors: + - name: + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + current_sensors: + - name: + sensor: + high_thresholds: [ , , ] + low_thresholds: [ , , ] + ... + + ##### PDDF Support