Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFC003 fan off test case #635

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
Open

CFC003 fan off test case #635

wants to merge 15 commits into from

Conversation

philipandag
Copy link
Contributor

No description provided.

@philipandag philipandag marked this pull request as draft December 17, 2024 09:44
@philipandag philipandag requested a review from miczyg1 January 9, 2025 10:58
@philipandag philipandag force-pushed the cfc-fan-off branch 4 times, most recently from bc433f9 to 024d98f Compare January 13, 2025 14:58
@philipandag philipandag marked this pull request as ready for review January 13, 2025 14:59
dasharo-compatibility/coreboot-fan-control.robot Outdated Show resolved Hide resolved
dasharo-compatibility/coreboot-fan-control.robot Outdated Show resolved Hide resolved
dasharo-compatibility/coreboot-fan-control.robot Outdated Show resolved Hide resolved
dasharo-compatibility/coreboot-fan-control.robot Outdated Show resolved Hide resolved
lib/sensors.robot Show resolved Hide resolved
platform-configs/raptor-cs_talos2.robot Outdated Show resolved Hide resolved
@philipandag philipandag force-pushed the cfc-fan-off branch 3 times, most recently from 767f887 to 805bbff Compare February 5, 2025 11:58
@miczyg1
Copy link
Contributor

miczyg1 commented Feb 10, 2025

@philipandag we have all discussion resolved now. I would say it is ready to go in. But I would like to see the logs fro the tests run from the latest commit from this PR.

@philipandag
Copy link
Contributor Author

custom-fan-curve.robot_log.zip
The logs from the tests after the rebases and review changes - performed on 43a0eac.

Copy link
Contributor

@miczyg1 miczyg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the test failed because the fan speed was decreasing slower than the temperature. Temperature can get very low quickly on the sensor, but the fan often needs couple of seconds to slow down.

The test may be failing all the time because of that. I wonder if it is really necessary to measure the fall time of fan speed based o temperature. If anything we may only check if the fan slows down to the expected temperature after 30 seconds for example. Right now I see the stress tests lasts 3m, and measurements are made with 0.05 minute intervals (3 seconds, 60 iterations). That is more than the stress test length, because there are delays caused by execution of test instructions.


IF ${lm_sensors_used} == ${TRUE}
Detect Or Install Package lm-sensors
Execute Command In Terminal sudo sensors-detect --auto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detecting sensors must happen after modules are loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

philipandag and others added 9 commits February 14, 2025 13:04
Signed-off-by: Filip Gołaś <[email protected]>

lib/sensors.robot: Aesthetical refactor

Signed-off-by: Filip Gołaś <[email protected]>

Make use of lib/sensors platform config vars

Signed-off-by: Filip Gołaś <[email protected]>

lib/sensors/robot: Use triple quotes for string comparisons

Signed-off-by: Filip Gołaś <[email protected]>

lib/sensors: import sensor config variables

Signed-off-by: Filip Gołaś <[email protected]>

lib/sensors.robot: Use hwmon path from sensors config file directly

Signed-off-by: Filip Gołaś <[email protected]>
Signed-off-by: Filip Gołaś <[email protected]>

cpu-fan-speed-measure.robot: Use lib/sensors

Signed-off-by: Filip Gołaś <[email protected]>

custom-fan-curve.robot: Use lib/sensors

Signed-off-by: Filip Gołaś <[email protected]>
Signed-off-by: Filip Gołaś <[email protected]>

platform-configs: Move sensor configs to yaml files

Signed-off-by: Filip Gołaś <[email protected]>

lib/sensors: Use new yaml sensor configs

Signed-off-by: Filip Gołaś <[email protected]>

sensors configs: Remove unnecessary lists

Signed-off-by: Filip Gołaś <[email protected]>

platform-configs/sensors curves: Add tolerances

Signed-off-by: Filip Gołaś <[email protected]>

platform-configs/include/sensors: Set default RPMS to always valid

Signed-off-by: Filip Gołaś <[email protected]>

configs sensors rename vp66xx -> vpxxxx

Signed-off-by: Filip Gołaś <[email protected]>

vpxxxx-fan-curve-config: More accurate RPM curves

Signed-off-by: Filip Gołaś <[email protected]>
To work using PWM or RPM depending on measurement of which is available

Signed-off-by: Filip Gołaś <[email protected]>

custom-fan-curve.robot: Use tolerances & continue if exceeded

Temperatures out of tolerances might still be useful or even
acceptable as a pass with manual verification

Signed-off-by: Filip Gołaś <[email protected]>

custom-fan-curve.robot: Move common test loop to local KW

Signed-off-by: Filip Gołaś <[email protected]>

custom-fan-curve.robot: More logs

Signed-off-by: Filip Gołaś <[email protected]>

custom-fan-curve.robot: Fail only on multiple invalid values in row

Signed-off-by: Filip Gołaś <[email protected]>
Weren't tested on any other device

Signed-off-by: Filip Gołaś <[email protected]>
The results can still be verified manually and used to adjust the curves

Signed-off-by: Filip Gołaś <[email protected]>
@philipandag
Copy link
Contributor Author

philipandag commented Feb 14, 2025

@miczyg1

Temperature can get very low quickly on the sensor, but the fan often needs couple of seconds to slow down.

At first I thought it is caused by the fan just being a physical object with a mass, but then the RPM would lag by a couple seconds at most. The tests show that the fan stays at ~1600 RPM for 15 seconds. It is definitely something caused by the software controlling it, not the fan itself.

The test may be failing all the time because of that.

If the code controlling the fans introduces some additional lag when slowing down, then wouldn't that just mean that the fan curves are wrong? That the fan behavior is more complex than we assume it is in the test suite?
Are you sure that after some time the fans will eventually slow down? Wouldn't we want to test how much time it takes for them to stop if it is controlled by firmware?

@@ -33,7 +34,8 @@ CFN001.001 CPU temperature and fan speed can be read (Debian)
Boot System Or From Connected Disk ubuntu
Login To Linux
Switch To Root User
${rpm} ${temperature}= Get CPU Temperature And CPU Fan Speed
Copy link
Contributor Author

@philipandag philipandag Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miczyg1
We have decided that the slowing down of the fans should not be included in the test.

In that case I would make the stress-ng timeout very long and kill the process once the measurements are done.
It will be more reliable than guessing how by much executing the test instructions will affect the time span across which the measurements are performed.

It seemed to work for a short 1m test. I'll rerun the whole suite to make sure everything works.
https://github.com/Dasharo/open-source-firmware-valida
custom-fan-curve.robot_log.zip
https://github.com/Dasharo/open-source-firmware-validation/compare/91df194%5E..91df194

Copy link
Contributor Author

@philipandag philipandag Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not measuring the fan when it slows down would make the tests useless. It will only measure how the fans behave when the CPU is at it's highest temperature, while the point is to cover the whole temperature range.

Too bad I believed that the tests were making sense before, because they were not.
The duration and measurement interval are given in minutes, while even one minute is very long for such test.
The tests are waiting for 5 seconds after starting the stress, which effectively skips the temperature rising.
Originally the test duration was around 30 minutes, both the fan speed and CPU temperature must have been at their peaks for the whole time.

I don't know what was the original intent behind those tests, but trying to add a new test case was a bad idea, as they definitely had a completely different purpose than what we would like to achieve here.

To make the PASSes actually usefull we should try to either increase the temperature of the CPU slowly, for example by increasing the number of workers as time passes, or make the measurements as quickly and often and possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the fans are in fact not slowing down below ~1600 RPM when on performance curve.
I've left it be for a couple minutes and the RPMs don't fall

root@3mdeb:/home/ubuntu# sensors | grep -E "fan|C"
Adapter: PCI adapter
Adapter: PCI adapter
Composite:    +27.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +27.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +28.9°C  (low  = -273.1°C, high = +65261.8°C)
fan1:        1679 RPM  (min =   28 RPM)
fan2:        1704 RPM  (min =   10 RPM)
temp1:        +29.0°C  (low  =  +0.0°C, high = +127.0°C)  sensor = Intel PECI
temp2:       -128.0°C  (low  = +89.0°C, high = -126.0°C)
temp3:       -128.0°C  (low  = -47.0°C, high = +30.0°C)
Package id 0:  +26.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +21.0°C  (high = +100.0°C, crit = +100.0°C)
Core 4:        +24.0°C  (high = +100.0°C, crit = +100.0°C)
Core 8:        +22.0°C  (high = +100.0°C, crit = +100.0°C)
Core 9:        +22.0°C  (high = +100.0°C, crit = +100.0°C)
Core 10:       +22.0°C  (high = +100.0°C, crit = +100.0°C)
Core 11:       +22.0°C  (high = +100.0°C, crit = +100.0°C)
Core 12:       +23.0°C  (high = +100.0°C, crit = +100.0°C)
Core 13:       +23.0°C  (high = +100.0°C, crit = +100.0°C)
Core 14:       +23.0°C  (high = +100.0°C, crit = +100.0°C)
Core 15:       +23.0°C  (high = +100.0°C, crit = +100.0°C)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time it takes for the CPU to jump from ~25C to ~80C is less than 0.1s. We can not test this feature without some kind of gradation of the load.

Unless we want to test the curve when the cpu cools down. It should be fine provided the fans will actually slow down on the performance curve. There is either a bug in the code controlling the fans or we cannot physically achieve the CPU temperatures required to test the performance curve.

If no sensor is able to read the actual CPU temperature used to control the fans, then where the code controlling the fans gets the temperature from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants