Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable resource monitoring in cgroup v2 machines #3316

Merged
merged 6 commits into from
Feb 19, 2025

Conversation

nagworld9
Copy link
Contributor

@nagworld9 nagworld9 commented Feb 6, 2025

Description

This PR just enable the resource monitoring in v2(> ubuntu 22) machines.

Couple of distinctions between V1 and V2

  1. By default, CPU throttle counter not present in v2, it's only available when we use quotas. So, now throttle time tracked if quota set.
  2. Cgroup.procs file is not present in the slice cgroup, only in leaf cgroup in v2, as opposed to v1. This file used to determine if cgroup is active or not. In the case of extension, we track metrics at the slice level but processes in scope cgroup. Now we check in scope too.

Issue #


PR information

  • Ensure development PR is based on the develop branch.
  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@nagworld9 nagworld9 changed the title Enable resource monitoring in v2 Enable resource monitoring in cgroup v2 machines Feb 6, 2025
@@ -389,73 +502,8 @@ def log_root_paths(self):
else:
log_cgroup_info("The {0} controller is mounted at {1}".format(controller, mount_point))

def start_extension_command(self, extension_name, command, cmd_name, timeout, shell, cwd, env, stdout, stderr,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to abstract class

@@ -407,41 +423,6 @@ def _create_all_files(files_to_create):
CGroupConfigurator._Impl._cleanup_unit_file(unit_file)
return

@staticmethod
def _get_current_cpu_quota(unit_name):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to util class

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posted some minor comments/questions

@@ -250,12 +294,81 @@ def log_root_paths(self):
"""
raise NotImplementedError()

def start_extension_command(self, extension_name, command, cmd_name, timeout, shell, cwd, env, stdout, stderr,
error_code=ExtensionErrorCodes.PluginUnknownFailure):
def get_enabled_controllers_at_root(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise NotImplementedError()?

@@ -250,12 +294,81 @@ def log_root_paths(self):
"""
raise NotImplementedError()

def start_extension_command(self, extension_name, command, cmd_name, timeout, shell, cwd, env, stdout, stderr,
error_code=ExtensionErrorCodes.PluginUnknownFailure):
def get_enabled_controllers_at_root(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to get_controllers_enabled_at_root?

@@ -351,6 +351,22 @@ def _reset_agent_cgroup_setup(self):
except Exception as err:
logger.warn("Error while resetting the quotas: {0}".format(err))

def _is_enforcement_supported(self, controller):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not check whether enforcement is supported, but rather whether we are enforcing limits using that controller, right? if yes, can we rename the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does two things

  1. If Agent supports enforcement (like feature support)
  2. If above is true, it will check controller is mounted/enabled to enforce the limits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is 1 checked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ex: CPU enforcement not supported in v2 yet but supported in v1

            if controller == "cpu":
                if self.using_cgroup_v2():
                    return False
                elif "cpu,cpuacct" in self._cgroups_api.get_controllers_enabled_at_root():
                    return True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, that is my point. this is not checking for "supported" (e.g. CPU Quotas are supported on V2); it's checking whether we are using the feature or not. name is misleading

Copy link
Contributor Author

@nagworld9 nagworld9 Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to _can_use_controller_for_enforcement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

finally:
with self._systemd_run_commands_lock:
self._systemd_run_commands.remove(process.pid)
def get_enabled_controllers_at_root(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to abstract class too? And Maybe we should add a comment in the abstract class that this only returns the agent supported controllers enabled at root (cpu and/or memory)

@@ -351,6 +351,22 @@ def _reset_agent_cgroup_setup(self):
except Exception as err:
logger.warn("Error while resetting the quotas: {0}".format(err))

def _is_enforcement_supported(self, controller):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this to the controller implementations?
CpuControllerV1 -> return True
CpuControllerV2 -> return False
_MemoryController -> return False

Then during cgroup initialization, we can just check controller.is_enforcement_supported()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but for extensions we setup the limits before we start the extension, and we will not have controllers objects to get this information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making them static in the controller implementation and checking via cgroupapi?
If you think that overcomplicates then I'm ok with leaving the check here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i followed, sorry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each controller implementation has a static can_enforce() check:
CpuControllerV1 -> return True
CpuControllerV2 -> return False
_MemoryController -> return False

Cgroupapi implementations have a can_enforce_cpu() check:
CgroupApiV1 -> return CgroupV1.CPU_CONTROLLER in self._cgroup_mountpoints and CpuControllerV1.can_enforce()
CgroupApiV2 -> return CgropuV2.CPU_CONTROLLER in self._controllers_enabled_at_root and CpuControllerV2.can_enforce()

Then cgroupconfigurator just calls self._cgroupapi.can_enforce_cpu()

^ This allows us to avoid these if/else conditions, but I'm also ok with leaving as is if this overcomplicates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can even simply without static can_enforce(), not sure if we need this

Cgroupapi implementations have a can_enforce_cpu() check:
CgroupApiV1 -> return CgroupV1.CPU_CONTROLLER in self._cgroup_mountpoints
CgroupApiV2 -> return False

can_enforce_memory() -> return False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@@ -164,7 +164,7 @@ def _found_cgroup_procs(file):
return True

# In v2, the cgroup.procs file is present in the scope cgroup for extensions
for cgroup_file in glob.iglob(os.path.join(self.path, "*/cgroup.procs")):
for cgroup_file in glob.iglob(os.path.join(self.path, "*.scope/cgroup.procs")):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to match more specifically the scopes we create?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we create scopes for extension cmds we run. It will be like

install_xxx.scope
enable_xxx.scope
update_xxx.scope

more specific would be matching all cmds that we support. Even though it's limited list, I don't know if we want to do that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"*" would match any of those 3 examples, and any other scope, if there is ever one.

my ask is if we can make it match only the scope that we know we created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, never mind, sorry, your original change was ok. we want to continue tracking as long as there are any processes, no matter who created them, so * is OK

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 81.35593% with 22 lines in your changes missing coverage. Please review.

Project coverage is 72.69%. Comparing base (3aebcdd) to head (f08f6d5).
Report is 362 commits behind head on develop.

Files with missing lines Patch % Lines
azurelinuxagent/ga/cgroupapi.py 76.11% 12 Missing and 4 partials ⚠️
azurelinuxagent/ga/cgroupcontroller.py 82.35% 2 Missing and 1 partial ⚠️
azurelinuxagent/ga/cgroupconfigurator.py 77.77% 1 Missing and 1 partial ⚠️
azurelinuxagent/ga/cpucontroller.py 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3316      +/-   ##
===========================================
+ Coverage    71.97%   72.69%   +0.71%     
===========================================
  Files          103      114      +11     
  Lines        15692    17200    +1508     
  Branches      2486     2304     -182     
===========================================
+ Hits         11295    12504    +1209     
- Misses        3881     4139     +258     
- Partials       516      557      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

narrieta
narrieta previously approved these changes Feb 14, 2025
@@ -176,6 +172,43 @@ def test_initialize_should_not_enable_cgroups_when_the_agent_is_not_in_the_syste
self.assertEqual(len(tracked), 0, "No cgroups should be tracked. Tracked: {0}".format(tracked))
self.assertFalse(os.path.exists(agent_drop_in_file_cpu_quota), "{0} should not have been created".format(agent_drop_in_file_cpu_quota))

def test_initialize_should_not_enable_cgroups_v2(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_initialize_should_not_enable_cgroups_v2(self):
def test_initialize_should_enable_cgroups_v2(self):

should enable cgroups v2?

@nagworld9 nagworld9 merged commit 1080ad0 into Azure:develop Feb 19, 2025
11 checks passed
@nagworld9 nagworld9 deleted the v2-monitor branch February 19, 2025 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants