Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change condor base cgroup path #309

Conversation

sanjaysrikakulam
Copy link
Member

@sanjaysrikakulam sanjaysrikakulam commented Sep 4, 2024

I hijacked a stuck VM from the old cloud, played around with it for a while, and tested various ways to resolve our issue with Galaxy not recording Cgroup stats.

The issue I raised with the HTCondor community, here is the email thread for reference

Hey Matthias,

Thank you for sharing! I thought of something similar to your script as a "quick fix" to resolve the problem temporarily.

Clarification:

The "cgroup.subtree_control" under "/sys/fs/cgroup/" and "/sys/fs/cgroup/system.slice" are created correctly.

Our BASE_CGROUP = system.slice/condor.service

Basically:

/sys/fs/cgroup/
    ├── cgroup.controllers 
    ├── cgroup.subtree_control
    ├── system.slice/
        ├── cgroup.controllers
        ├── cgroup.subtree_control
        ├── condor.service/
            ├── cgroup.controllers
            ├── cgroup.subtree_control (empty)
            └── <HTCondor jobs/subgroups>/
                ├── cgroup.controllers (empty)
                └── cgroup.subtree_control (empty)

I hope this adds more clarity to my question. Not sure why HTCondor is not inheriting the parent "cgroup.subtree_control" correctly from the "system.slice" and probably this is the reason why the job/subgroup specific dirs are not getting configured properly. I will set up a test instance and see if the "quick fix" works for me. I hope someone has a fix to our problem.

On 8/15/2024 5:18 PM, Matthias Schnepf wrote:
> Hi,
>
> I'm not sure why at point 6 of your "cgroup.subtree_control" file is empty and what manages it (condor or systemd, I think).
> We have a similar problem that the cgroup.the controller does not get set correctly.
> I hope someone else has an idea to fix your/our problem with the empty "cgroup.subtree_control" file.
>
> But here an idea of our "quick fix" we currently use.
> We use the development version of condor (23.7.2) and RHEL8.
> Our condor settings for cgroup v2 are:
>
> BASE_CGROUP = htcondor
> CGROUP_MEMORY_LIMIT_POLICY = custom
> CGROUP_HARD_MEMORY_LIMIT_EXPR = 2 * Target.RequestMemory
> CGROUP_LOW_MEMORY_LIMIT = 0.75 * Target.RequestMemory
>
> The job cgroups are created in /sys/fs/cgroup/htcondor. We set the cgroup.subtree_control file via a cronjob at boot time.
>
>
> #!/bin/bash
>
> echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/cgroup.subtree_control
> export cgroup_name="/sys/fs/cgroup/htcondor"
> if [ ! -d ${cgroup_name} ]; then
>     mkdir ${cgroup_name}
> fi
> echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/htcondor/cgroup.subtree_control
>
> With that, CPU, memory, and pids controller are set for the htcondor cgroup and its jobs/subgroups. With that, condor sets the correct memory limits, CPU weights, and monitors the memory.
>
> Best regards,
>
> Matthias
>
>
> On 8/15/24 4:47 PM, Sanjay Kumar Srikakulam wrote:
>> Hi,
>>
>> We run an HTCondor cluster and recently noticed we are missing the Cgroups accounting. Our setup,
>>
>> HTCondor:
>>
>> $CondorVersion: 23.0.6 2024-03-14 BuildID: 720565 PackageID: 23.0.6-1 $
>> $CondorPlatform: x86_64_AlmaLinux9 $
>>
>> 1. We are using Rocky 9 on workers
>> 2. CgroupV2 is mounted on the workers
>> 3. CgroupV2 controllers file as the list: "cpuset cpu io memory hugetlb pids rdma misc"
>> 4. HTCondor is configured to use CGroups:
>>
>> BASE_CGROUP = system.slice/condor.service
>> CGROUP_MEMORY_LIMIT_POLICY = hard
>> RESERVED_MEMORY = 2048
>>
>> 5. I can see the "condor.service" directory under "/sys/fs/cgroup/system.slice"
>> 6. HTCondor is inheriting the parent controllers properly: I see the "cgroup.controllers" file and has the same list of controllers as the parent (above). However, the "cgroup.subtree_control" file is empty (the parent has the list of controller names; so this is not getting created or inherited properly)
>> 7. As per the HTCondor doc (https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#cgroup-based-process-tracking), that once the BASE_CGROUP is defined, for every condor job there will be a dedicated dir in the BASE_CGROUP path for cgroup accounting. When jobs are submitted, I see the subdirectories "condor_var_lib_condor_execute_slot1_7@hostname". However, the "cgroup.controllers" file is empty in these sub-directories and is somehow not inheriting the parent. Similarly, the "cgroup.subtree_control" file is also empty.
>>
>> 8. We also added the "CREATE_CGROUP_WITHOUT_ROOT = True" to our HTCondor config and restarted the condor services without luck.
>> 9. Also, from the starter log: "StarterLog.slot1_1:08/15/24 14:21:09 (pid:3758318) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/condor.service/cgroup.subtree_control: Device or resource busy", HTCondor seem to be hitting the "no internal processes" rule (https://unix.stackexchange.com/questions/680167/ebusy-when-trying-to-add-process-to-cgroup-v2; https://manpath.be/f35/7/cgroups#L557).
>>
>> Any help on resolving this is much appreciated! 

I tried several ways to update Condors Cgroup conf and make it inherit the root Cgroup controllers and subtree_conrtol and nothing helped.

The simple solution is to change the BASE_CGROUP path to htcondor, which will be here /sys/fs/cgroup/htcondor unlike the previous one, which is in the system.slice/condor.service. The systemd controlled cgroups are not so easy to change or tweak, which is located under /sys/fs/cgroup/system.slice. By changing the BASE_CGROUP path to the root of the Cgroup (/sys/fs/cgroup), the htcondor, which is the child, inherits the controllers and subtree_control config from its parent,

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ cat htcondor/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ cat htcondor/cgroup.subtree_control
cpu io memory pids

I also ran a test job as Galaxy and submitted the below job to this test machine.

Universe = vanilla
Executable = test_job_cgroup.sh
Log = test_job_cgroup.log
Output = test_job_cgroup.out
Error = test_job_cgroup.err
Request_cpus = 1
requirements = (Machine == "vgcnbwc-worker-c120m225-test-0000.novalocal")
Queue

test_job_cgroup.sh script (the snippet is actually from Galaxy, this is what Galaxy adds to every job script)

#!/bin/bash
echo "Hello World"
echo $(hostname)

sleep 60

for ((i=1; i<=10000000; i++)); do
    :
done

# Cgroup stuff added by Galaxy to each job script
if [ -e "/proc/$$/cgroup" -a -d "/sys/fs/cgroup" -a ! -f "/sys/fs/cgroup/cgroup.controllers" ]; then
    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '($2=="cpuacct,cpu") || ($2=="cpu,cpuacct") {print $3}');

    if [ ! -e "/sys/fs/cgroup/cpu$cgroup_path/cpuacct.usage" ]; then
        cgroup_path="";
    fi;

    for f in /sys/fs/cgroup/{cpu\,cpuacct,cpuacct\,cpu}$cgroup_path/{cpu,cpuacct}.*; do
        if [ -f "$f" ]; then
            echo "__$(basename $f)__" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics;
            cat "$f" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics 2>/dev/null;
        fi;
    done;

    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '$2=="memory"{print $3}');

    if [ ! -e "/sys/fs/cgroup/memory$cgroup_path/memory.max_usage_in_bytes" ]; then
        cgroup_path="";
    fi;

    for f in /sys/fs/cgroup/memory$cgroup_path/memory.*; do
        echo "__$(basename $f)__" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics;
        cat "$f" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics 2>/dev/null;
    done;
fi

if [ -e "/proc/$$/cgroup" -a -f "/sys/fs/cgroup/cgroup.controllers" ]; then
    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '($1=="0") {print $3}');

    echo "$cgroup_path"
    ls -la /sys/fs/cgroup/${cgroup_path}/
    for f in /sys/fs/cgroup/${cgroup_path}/{cpu,memory}.*; do
        echo "__$(basename $f)__" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics;
        cat "$f" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics 2>/dev/null;
    done;
fi

sleep 10

Upon checking the job-specific Cgroup on the test host, we can see that the child Cgroup is being created, and it successfully inherits the controllers from the parent.

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ ll htcondor/
total 0
-r--r--r--. 1 root root 0 Sep  4 11:56 cgroup.controllers
-r--r--r--. 1 root root 0 Sep  4 11:56 cgroup.events
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.freeze
--w-------. 1 root root 0 Sep  4 11:56 cgroup.kill
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.max.depth
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.max.descendants
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.procs
-r--r--r--. 1 root root 0 Sep  4 11:56 cgroup.stat
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.subtree_control
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.threads
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.type
drwxr-xr-x. 2 root root 0 Sep  4 11:56 condor_var_lib_condor_execute_slot1_1@vgcnbwc-worker-c120m225-test-0000.novalocal
-rw-r--r--. 1 root root 0 Sep  4 11:56 cpu.idle                                                                                                                                                                                                                                                                .....
.....

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ cat htcondor/condor_var_lib_condor_execute_slot1_1@vgcnbwc-worker-c120m225-test-0000.novalocal/cgroup.controllers
cpu io memory pids

Here is the Cgroups output from the test job

__instrument_cgroup__metrics

__cpu.idle__
0
__cpu.max__
max 100000
__cpu.max.burst__
0
__cpu.stat__
usage_usec 31291908
user_usec 25429593
system_usec 5862315
core_sched.force_idle_usec 0
nr_periods 0
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0
__cpu.weight__
100
__cpu.weight.nice__
0
__memory.current__
2408448
__memory.events__
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0
__memory.events.local__
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0
__memory.high__
max
__memory.low__
0
__memory.max__
134217728
__memory.min__
0
__memory.numa_stat__
anon N0=241664
file N0=45056
kernel_stack N0=16384
pagetables N0=36864
sec_pagetables N0=0
shmem N0=0
file_mapped N0=0
file_dirty N0=0
file_writeback N0=0
swapcached N0=0
anon_thp N0=0
file_thp N0=0
shmem_thp N0=0
inactive_anon N0=221184
active_anon N0=4096
inactive_file N0=40960
active_file N0=4096
unevictable N0=0
slab_reclaimable N0=34936
slab_unreclaimable N0=129288
workingset_refault_anon N0=0
workingset_refault_file N0=0
workingset_activate_anon N0=0
workingset_activate_file N0=0
workingset_restore_anon N0=0
workingset_restore_file N0=0
workingset_nodereclaim N0=0
__memory.oom.group__
1
__memory.peak__
3530752
__memory.reclaim__
__memory.stat__
anon 237568
file 45056
kernel 761856
kernel_stack 16384
pagetables 32768
sec_pagetables 0
percpu 0
sock 0
vmalloc 0
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 217088
active_anon 4096
inactive_file 40960
active_file 4096
unevictable 0
slab_reclaimable 34936
slab_unreclaimable 136120
slab 171056
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 0
pgsteal 0
pgscan_kswapd 0
pgscan_direct 0
pgsteal_kswapd 0
pgsteal_direct 0
pgfault 5852
pgmajfault 1
pgrefill 0
pgactivate 1
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 0
thp_collapse_alloc 0
__memory.swap.current__
0
__memory.swap.events__
high 0
max 0
fail 0
__memory.swap.high__
max
__memory.swap.max__
max
__memory.zswap.current__
0
__memory.zswap.max__
max

NOTE: we need to redeploy all workers to have them properly report the Cgroup stats to Galaxy; we have lost Cgroup stats for almost a year in the Galaxy DB table job_metric_numeric

The above issue was briefly discussed in here.

@bgruening
Copy link
Member

I can not say that I understand everything, but it sounds all clear. +1 from my side.

Please talk to Manuel, he will also redeploy new images to the cloud hosts in the next days, so we can piggy back on that and deploy now VM images as well.

Thanks @sanjaysrikakulam

@sanjaysrikakulam
Copy link
Member Author

I can not say that I understand everything, but it sounds all clear. +1 from my side.

Please talk to Manuel, he will also redeploy new images to the cloud hosts in the next days, so we can piggy back on that and deploy now VM images as well.

Thanks @sanjaysrikakulam

Yup, that's my plan as well.

Copy link
Contributor

@mira-miracoli mira-miracoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I think BASE_CGROUP=htcondor should be a safe option; it is also the default value according to the documentation.

@sanjaysrikakulam sanjaysrikakulam merged commit 25198b7 into usegalaxy-eu:main Sep 5, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants