Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bpf): exclude swapper process bpf_cpu_time #1830

Merged
merged 1 commit into from
Nov 4, 2024

Conversation

vimalk78
Copy link
Collaborator

when cpu is idle, swapper process is scheduled, and kepler adds its bpf_cpu_time to kernel processes

adding a configuration to exclude swapper process

with-swapper
with-swapper

without-swapper
without-swapper

@vimalk78 vimalk78 requested a review from sthaha October 29, 2024 17:43
Copy link
Contributor

github-actions bot commented Oct 29, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request, "fix(bpf): exclude swapper process bpf_cpu_time", introduces a configuration option to exclude the swapper process's bpf_cpu_time from kernel processes when the CPU is idle.

Key Modifications:

  • Added ExcludeSwapperProcess boolean field to KeplerConfig struct
  • Modified getKeplerConfig() function to include the new field
  • Added ExcludeSwapperProcess() retrieval function
  • Set default value of defaultExcludeSwapperProcess to true

Impact: These changes affect the internal implementation of the code, but do not alter the external interface or behavior.

Observations:

  • The addition of a configuration option provides flexibility for users to customize the behavior of the bpf_cpu_time collection.
  • The default value of true for defaultExcludeSwapperProcess suggests that excluding the swapper process is the recommended behavior.

Suggestions for Improvement:

  • Consider adding a comment or documentation to explain the reasoning behind excluding the swapper process and the implications of enabling/disabling this feature.
  • It would be helpful to include tests to verify the correct behavior of the ExcludeSwapperProcess feature.

@rootfs
Copy link
Contributor

rootfs commented Oct 30, 2024

Instead of disabling idle time accounting, I am thinking of using the idle time to calculate the idle power as the next step.
cc @marceloamaral @sunya-ch

@@ -89,6 +89,11 @@ func UpdateProcessBPFMetrics(bpfExporter bpf.Exporter, processStats map[uint64]*
for _, ct := range processesData {
comm := C.GoString((*C.char)(unsafe.Pointer(&ct.Comm)))

if ct.Pid == 0 && config.ExcludeSwapperProcess() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about add a new metrics called idle cpu time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good idea and aligns with user / system / idle time in https://man7.org/linux/man-pages/man5/proc_stat.5.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vimalk78 with this patch are you able to get sum(rate(kepler_process_bpf_cpu_time_ms_total[20s]))) == sum(rate(node_cpu_seconds_total{mode=~"user|system"}[20s]))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change, under load condition
image

@rootfs
Copy link
Contributor

rootfs commented Nov 4, 2024

I am merging this now and will come up with a new idle time metrics PR.

@rootfs rootfs enabled auto-merge (squash) November 4, 2024 18:25
@rootfs rootfs disabled auto-merge November 4, 2024 18:26
@rootfs rootfs merged commit 75b9533 into sustainable-computing-io:main Nov 4, 2024
20 checks passed
@rootfs
Copy link
Contributor

rootfs commented Nov 5, 2024

@vimalk78 @sthaha @KaiyiLiu1234 @sunya-ch

I tested this idle time exclusion in metal CI. There is improvement on SGD trainer but not much on exponential or logarithmic trainer.

Please find the new config parameter in kepler metal and VM systemd.

In the trained model validation report, those in Nov 01 are without this change, and those in Nov 04 are with this new config param. The SGD trainer MAPE sees improvement from 51% to 21%

@sthaha
Copy link
Collaborator

sthaha commented Nov 5, 2024

We should see improvement in all models in theory. @sunya-ch is there a way to plot or export all metrics used for training the model ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants