Skip to content

Commit

Permalink
[gateway] Report CPU Tctl as a dimensionless metric (#6656)
Browse files Browse the repository at this point in the history
Currently, the AMD SP3 host CPU's T<sub>ctl</sub> value is reported to
Oximeter as an instance of the `hardware_component:temperature` metric
with `sensor = "CPU"`. This metric's unit is in degrees Celsius.

This is incorrect. T<sub>ctl</sub> is not a physical measurement from a
temperature sensor, but an internal parameter of the CPU's thermal
control loop in synthetic dimensionless units that range from 0-100. For
details, refer to [this comment][1] and oxidecomputer/stlouis#5.

This branch adds a new `hardware_component:amd_cpu_tctl` metric. Unlike
the `hardware_component:temperature` metric, this metric has no unit, as
it is a dimensionless value. The SP sensor metrics task in MGS has been
changed to special-case temperature measurements where the sensor name
is "CPU" and the device kind is "sbtsi" so that they are reported using
the `hardware_componentamd_:cpu_tctl` metric rather than the
`hardware_component:temperature` metric.

In the future, I think a more ideologically correct solution to this
would be to add a variant to the
`gateway_messages::measurement::MeasurementKind` enum to represent
dimensionless measurements (or, perhaps, specifically for
T<sub>ctl</sub>?), and change Hubris to report the SB-TSI
T<sub>ctl</sub> using that instead. However, this looks like it would
probably be a somewhat more complex change in Hubris, as the thermal
control loop would still need to consider that measurement. This change
introduces the new metric type and fixes the problem of us reporting
incorrectly-labeled metrics, so I think it's worth handling it this way
and then going back and making the Hubris change later.

Fixes #6634

[1]: https://github.com/illumos/illumos-gate/blob/fabe1c1dce7b8e0bda70c23798d91f63c2a5a2c5/usr/src/uts/intel/io/amdzen/smntemp.c#L21-L57
  • Loading branch information
hawkw authored Sep 25, 2024
1 parent 68583ad commit dbbb77b
Show file tree
Hide file tree
Showing 7 changed files with 150 additions and 4 deletions.
2 changes: 2 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ SP DETAILS: type "Sled" slot 0
dev-7 FAKE U.2 Sharkfin A hot swap controller max5970 Present None
dev-8 FAKE U.2 A NVMe Basic Management Command nvme_bmc Present None
dev-39 FAKE T6 temperature sensor tmp451 Present None
dev-46 CPU temperature sensor sbtsi Present None
dev-53 FAKE Fan controller max31790 Present None

CABOOSES: none found
Expand Down Expand Up @@ -183,6 +184,7 @@ SP DETAILS: type "Sled" slot 1
dev-7 FAKE U.2 Sharkfin A hot swap controller max5970 Present None
dev-8 FAKE U.2 A NVMe Basic Management Command nvme_bmc Present None
dev-39 FAKE T6 temperature sensor tmp451 Present None
dev-46 CPU temperature sensor sbtsi Present None
dev-53 FAKE Fan controller max31790 Present None

CABOOSES: none found
Expand Down
22 changes: 22 additions & 0 deletions gateway-test-utils/configs/sp_sim_config.test.toml
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,17 @@ presence = "Present"
sensors = [
{ name = "t6", kind = "Temperature", last_data.value = 70.625, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-46"
device = "sbtsi"
description = "CPU temperature sensor"
capabilities = 2
presence = "Present"
sensors = [
{ name = "CPU", kind = "Temperature", last_data.value = 64.5, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-53"
device = "max31790"
Expand Down Expand Up @@ -222,6 +233,17 @@ presence = "Present"
sensors = [
{ name = "t6", kind = "Temperature", last_data.value = 70.625, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-46"
device = "sbtsi"
description = "CPU temperature sensor"
capabilities = 2
presence = "Present"
sensors = [
{ name = "CPU", kind = "Temperature", last_data.value = 62.6, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-53"
device = "max31790"
Expand Down
40 changes: 40 additions & 0 deletions gateway/src/metrics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -766,6 +766,24 @@ impl SpPoller {
};
let sensor: Cow<'static, str> = Cow::Owned(m.name);

// Temperature measurements which have the sensor name "CPU"
// and are from the SB-TSI device kind are CPU Tctl values.
// These are neither Fahrenheit nor Celsius, but a secret
// third thing, which is not actually a "temperature
// measurement" in the sense that you probably imagined when
// you saw the words "temperature" and "measurement".
// Instead, it's a dimensionless value in the range from
// 0-100 that's calculated by the CPU's thermal control
// loop.
//
// Therefore, we report it as a different metric from the
// `hardware_component:temperature` metric, which is
// in degrees Celsius.
//
// See this comment for details on what Tctl values mean:
// https://github.com/illumos/illumos-gate/blob/6cf3cc9d1e40f89e90135a48f74f03f879fce639/usr/src/uts/intel/io/amdzen/smntemp.c#L21-L57
let is_tctl =
sensor == "CPU" && target.component_kind == "sbtsi";
// First, if there's a measurement error, increment the
// error count metric. We will synthesize a missing sample
// for the sensor's metric as well, after we produce the
Expand All @@ -776,6 +794,9 @@ impl SpPoller {
// cloning it in *case* there's an error.
if let Err(error) = m.value {
let kind = match m.kind {
MeasurementKind::Temperature if is_tctl => {
"amd_cpu_tctl"
}
MeasurementKind::Temperature => "temperature",
MeasurementKind::Current => "current",
MeasurementKind::Voltage => "voltage",
Expand Down Expand Up @@ -827,12 +848,31 @@ impl SpPoller {
// something to produce a datum from both the `Ok` and
// `Error` cases...
let sample = match (m.value, m.kind) {
// The CPU's Tctl value gets reported as a separate
// metric, as it is dimensionless.
(Ok(datum), MeasurementKind::Temperature)
if is_tctl =>
{
Sample::new(
target,
&metric::AmdCpuTctl { sensor, datum },
)
}
// Other measurements with the "temperature" measurement
// kind are physical temperatures that actually exist in
// reality (and are always in Celsius).
(Ok(datum), MeasurementKind::Temperature) => {
Sample::new(
target,
&metric::Temperature { sensor, datum },
)
}
(Err(_), MeasurementKind::Temperature) if is_tctl => {
Sample::new_missing(
target,
&metric::AmdCpuTctl { sensor, datum: 0.0 },
)
}
(Err(_), MeasurementKind::Temperature) => {
Sample::new_missing(
target,
Expand Down
18 changes: 18 additions & 0 deletions gateway/tests/integration_tests/component_list.rs
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,15 @@ async fn component_list() {
.bits(),
presence: SpComponentPresence::Present,
},
SpComponentInfo {
component: "dev-46".to_string(),
device: "sbtsi".to_string(),
serial_number: None,
description: "CPU temperature sensor".to_string(),
capabilities: DeviceCapabilities::HAS_MEASUREMENT_CHANNELS
.bits(),
presence: SpComponentPresence::Present,
},
SpComponentInfo {
component: "dev-53".to_string(),
device: "max31790".to_string(),
Expand Down Expand Up @@ -204,6 +213,15 @@ async fn component_list() {
.bits(),
presence: SpComponentPresence::Present,
},
SpComponentInfo {
component: "dev-46".to_string(),
device: "sbtsi".to_string(),
serial_number: None,
description: "CPU temperature sensor".to_string(),
capabilities: DeviceCapabilities::HAS_MEASUREMENT_CHANNELS
.bits(),
presence: SpComponentPresence::Present,
},
SpComponentInfo {
component: "dev-53".to_string(),
device: "max31790".to_string(),
Expand Down
19 changes: 18 additions & 1 deletion nexus/tests/integration_tests/metrics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -581,6 +581,7 @@ async fn test_mgs_metrics(
let mut input_voltage_sensors = HashMap::new();
let mut input_current_sensors = HashMap::new();
let mut fan_speed_sensors = HashMap::new();
let mut cpu_tctl_sensors = HashMap::new();
for sp in all_sp_configs {
let mut temp = 0;
let mut current = 0;
Expand All @@ -589,11 +590,24 @@ async fn test_mgs_metrics(
let mut input_current = 0;
let mut power = 0;
let mut speed = 0;
let mut cpu_tctl = 0;
for component in &sp.components {
for sensor in &component.sensors {
use gateway_messages::measurement::MeasurementKind as Kind;
match sensor.def.kind {
Kind::Temperature => temp += 1,
Kind::Temperature => {
// Currently, Tctl measurements are reported as a
// "temperature" measurement, but are tracked by a
// different metric, as they are not actually a
// measurement of physical degrees Celsius.
if component.device == "sbtsi"
&& sensor.def.name == "CPU"
{
cpu_tctl += 1
} else {
temp += 1;
}
}
Kind::Current => current += 1,
Kind::Voltage => voltage += 1,
Kind::InputVoltage => input_voltage += 1,
Expand All @@ -610,6 +624,7 @@ async fn test_mgs_metrics(
input_current_sensors.insert(sp.serial_number.clone(), input_current);
fan_speed_sensors.insert(sp.serial_number.clone(), speed);
power_sensors.insert(sp.serial_number.clone(), power);
cpu_tctl_sensors.insert(sp.serial_number.clone(), cpu_tctl);
}

async fn check_all_timeseries_present(
Expand Down Expand Up @@ -745,6 +760,8 @@ async fn test_mgs_metrics(
.await;
check_all_timeseries_present(&cptestctx, "fan_speed", fan_speed_sensors)
.await;
check_all_timeseries_present(&cptestctx, "amd_cpu_tctl", cpu_tctl_sensors)
.await;

// Because the `ControlPlaneTestContext` isn't managing the MGS we made for
// this test, we are responsible for removing its logs.
Expand Down
32 changes: 29 additions & 3 deletions oximeter/oximeter/schema/hardware-component.toml
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,20 @@ description = """
Which kind of sensor could not be read due to a sensor error.
This will be one of 'temperature', 'current', 'power', 'voltage', \
'input_current', 'input_voltage', or 'fan_speed' (the same names as \
the metrics emitted by these sensors when they are read successfully)."""
'input_current', 'input_voltage', 'fan_speed', or 'amd_cpu_tctl' (the same \
names as the metrics emitted by these sensors when they are read \
successfully)."""

[[metrics]]
name = "temperature"
description = "A temperature reading from a hardware component."
description = """
A physical temperature reading from a hardware component.
Note that AMD host CPUs report a control temperature (Tctl) value, which is \
*not* a temperature measurement in degrees Celsius, but a dimensionless \
value that represents how close the CPU is to thermal throttling. This value \
is recorded in a separate `hardware_component:amd_cpu_tctl` metric.\
"""
units = "degrees_celsius"
datum_type = "f32"
versions = [
Expand Down Expand Up @@ -181,3 +189,21 @@ datum_type = "cumulative_u64"
versions = [
{ added_in = 1, fields = ["error"] }
]

[[metrics]]
name = "amd_cpu_tctl"
description = """
A CPU Tctl (control temperature) reading from an AMD CPU.
Note that, unlike other temperature metrics, this is not a measurement of \
temperature in degrees Celsius or Fahrenheit, but is instead a dimensionless \
quantity used by the processor's thermal control loop. This value has two \
notable thresholds:
1. At a value of 95, the CPU will begin internal thermal throttling.
2. At a value of 100, the CPU will shut down."""
units = "none"
datum_type = "f32"
versions = [
{ added_in = 1, fields = ["sensor"]}
]
21 changes: 21 additions & 0 deletions sp-sim/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,16 @@ sensors = [
{ name = "Southwest", kind = "Temperature", last_data.value = 41.7890625, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-46"
device = "sbtsi"
description = "CPU temperature sensor"
capabilities = 2
presence = "Present"
sensors = [
{ name = "CPU", kind = "Temperature", last_data.value = 64.5, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet]]
multicast_addr = "ff15:0:1de::2"
bind_addrs = ["[::]:33320", "[::]:33321"]
Expand All @@ -59,6 +69,17 @@ sensors = [
{ name = "Southwest", kind = "Temperature", last_data.value = 41.7890625, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-46"
device = "sbtsi"
description = "CPU temperature sensor"
capabilities = 2
presence = "Present"
sensors = [
{ name = "CPU", kind = "Temperature", last_data.value = 63.1, last_data.timestamp = 1234 },
]



[log]
# Show log messages of this level and more severe
Expand Down

0 comments on commit dbbb77b

Please sign in to comment.