Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tpv_db_optimizer] Automatically update resource requests based on u… #64

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nuwang
Copy link
Member

@nuwang nuwang commented Jun 24, 2024

…segalaxy historical data.

Needs manual vetting.

This repository contains the code for generating automated updates: https://github.com/nuwang/tpv-db-optimizer
Once we refine the details, we can transfer the repo over to the galaxyproject.

These scripts rely on the existence of some materialized views in the usegalaxy.* databases, which are defined here:
https://github.com/nuwang/tpv-db-optimizer/blob/0cc2ba687b709120295476fd7da1f1500a551bdd/views.py#L2

Therefore, the script is currently run on a consolidated database which was manually assembled. Instead, it would probably be better if we could run it directly against usegalaxy.* databases, assuming the materialized views exist and are periodically updated.

These views could potentially be used for gxy.io/KUI as well.

@nuwang nuwang changed the title [tpv_auto_optimize] Automatically update resource requests based on u… [tpv_db_optimizer] Automatically update resource requests based on u… Jun 29, 2024
@nuwang nuwang requested a review from bgruening July 28, 2024 12:11
@sanjaysrikakulam
Copy link
Collaborator

At first glance, I felt:

  1. I think the default mem should be at least 3.8G: The max and the adjusted_mem both should have the range above the default mem (something like 3.8 or 4 GB). I think allocating mem to be less than that might warrant unforeseen errors and issues, and having a 1 or 2G buffer is good in all cases in my opinion. Most of the changes seem to set mem to <1 or 2G.
  2. The mem changes made to some of the existing (ones already in the TPV shared DB) tools look ok, as most are close to what was already defined. However, there are a few exceptions where the new settings seem super low (3 to 4 times low), and I am a bit suspicious of that. For example, in signalp3, the change is from 10G to 0.67G (several others are also in the same range). Similarly, we must double-check for the schicexplorer_* tools. I think by setting the minimum mem to 3.8G, we can give them a try and increase when necessary.

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @nuwang ... I tried adding a few comments and flagged a few tools that I find suspicious. Not sure how we should proceed with those.

testtoolshed.g2.bx.psu.edu/repos/simon-gladman/phyloseq_filter/biom_filter/.*:
mem: 60.8
mem: 58.16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nuwang are those values the RAW values? Or did you already added some wiggle?

I would propose to round it up a bit and maybe add a comment # raw recom: 58.16, or something similar.

Copy link
Member Author

@nuwang nuwang Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -157,13 +268,13 @@ tools:
mem: 60
toolshed.g2.bx.psu.edu/repos/bgruening/hicexplorer_chicqualitycontrol/hicexplorer_chicqualitycontrol/.*:
cores: 20
mem: 60
mem: 9.01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting ... I'm a bit sceptical about that, but maybe we should try.

@@ -271,24 +400,46 @@ tools:
cores: 16
mem: 36
toolshed.g2.bx.psu.edu/repos/bgruening/rdock_sort_filter/rdock_sort_filter/.*:
mem: 90
mem: 4.67
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume we had a reason for this. Do we take the filesize into account? I'm surprised, but I guess we just try again.

toolshed.g2.bx.psu.edu/repos/bgruening/rxdock_sort_filter/rxdock_sort_filter/.*:
mem: 90
mem: 4.67
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

mem: 2.51
toolshed.g2.bx.psu.edu/repos/devteam/cummerbund_to_tabular/cummerbund_to_cuffdiff/.*:
mem: 0.48
toolshed.g2.bx.psu.edu/repos/devteam/data_manager_bowtie2_index_builder/bowtie2_index_builder_data_manager/.*:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take the max for data_managers, or create rules.

@@ -780,7 +1248,7 @@ tools:
mem: 8
toolshed.g2.bx.psu.edu/repos/galaxyp/openms_idfileconverter/IDFileConverter/.*:
cores: 4
mem: 8
mem: 0.64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this one ... looks abnormal.

@@ -898,7 +1372,7 @@ tools:
cores: 4
mem: 8
toolshed.g2.bx.psu.edu/repos/galaxyp/openms_openswathworkflow/OpenSwathWorkflow/.*:
mem: 156
mem: 14.01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outch, I guess this one needs a rule ?

@@ -2342,29 +3255,43 @@ tools:
mem: 20
toolshed.g2.bx.psu.edu/repos/iuc/vardict_java/vardict_java/.*:
cores: 2
mem: 128
mem: 36.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had a reason for this ... if I would just remember. A rule is probably needed here.

@@ -2406,22 +3383,171 @@ tools:
toolshed.g2.bx.psu.edu/repos/peterjc/seq_select_by_id/seq_select_by_id/.*:
mem: 8
toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/signalp3/.*:
mem: 10
mem: 0.67
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag

@@ -2441,35 +3569,63 @@ tools:
cores: 10
mem: 20
toolshed.g2.bx.psu.edu/repos/rnateam/paralyzer/paralyzer/.*:
mem: 8
mem: 0.59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag

@nuwang
Copy link
Member Author

nuwang commented Aug 8, 2024

Thanks @sanjaysrikakulam and @bgruening for reviewing, this feedback is super useful to identify cases to zoom into. Currently, the memory usage is calculated based on the max recorded cgroup memory usage across the past year. Only one year was used because older cgroup metrics are quite iffy - older ones seem to capture the entire node etc. etc. yielding incorrect results.

The ones that have dramatically lower max values than recorded do need further investigation, but at least in my previous spot checks, were mostly correct at last as far as the cgroup metrics were concerned. I will check the ones you have flagged in greater detail and report back.

I think there's still a fair bit more work to do to make sure we are recording correct metrics across the entire federation for cpu, memory and if possible, perhaps even gpu. We should also merge this oustanding PR soon so we can more easily track actual tpv mem allocations: https://github.com/galaxyproject/tpv-shared-database/pull/63/files
If we can get these issues fixed, we should be able to enhance the scripts for CPU optimization as well I think, and rerun say mid to late next year with even more data.

@nuwang
Copy link
Member Author

nuwang commented Aug 8, 2024

At first glance, I felt:

  1. I think the default mem should be at least 3.8G: The max and the adjusted_mem both should have the range above the default mem (something like 3.8 or 4 GB). I think allocating mem to be less than that might warrant unforeseen errors and issues, and having a 1 or 2G buffer is good in all cases in my opinion. Most of the changes seem to set mem to <1 or 2G.

I agree with your point - we don't want to cause mass failures and have a huge admin headache. Perhaps we can take a more incremental approach and experimentally lower just a few tools and see how well we do. The problem with a minimum allocation of say 4GB is that frequently used tools that actually don't require much memory result in substantial wastage in aggregate.

  1. The mem changes made to some of the existing (ones already in the TPV shared DB) tools look ok, as most are close to what was already defined. However, there are a few exceptions where the new settings seem super low (3 to 4 times low), and I am a bit suspicious of that. For example, in signalp3, the change is from 10G to 0.67G (several others are also in the same range). Similarly, we must double-check for the schicexplorer_* tools. I think by setting the minimum mem to 3.8G, we can give them a try and increase when necessary.

Agree, those are really weird. One possibility that comes to mind is that the captured metrics are somehow wrong. Another perhaps is that the tool just wasn't used that much during the past year, and hence whatever it was used for didn't use much memory. I'll investigate the one's you've both flagged and see what the older data says across federation data.

@sanjaysrikakulam
Copy link
Collaborator

At first glance, I felt:

  1. I think the default mem should be at least 3.8G: The max and the adjusted_mem both should have the range above the default mem (something like 3.8 or 4 GB). I think allocating mem to be less than that might warrant unforeseen errors and issues, and having a 1 or 2G buffer is good in all cases in my opinion. Most of the changes seem to set mem to <1 or 2G.

I agree with your point - we don't want to cause mass failures and have a huge admin headache. Perhaps we can take a more incremental approach and experimentally lower just a few tools and see how well we do. The problem with a minimum allocation of say 4GB is that frequently used tools that actually don't require much memory result in substantial wastage in aggregate.

An incremental approach would be a nice way to move forward. I agree about the aggregate wastage that may arise by setting a minimum allocation of 4GB, for example. Maybe we can start with a minimum of 2.5 GB allocation and then increase if and when necessary, case by case.

  1. The mem changes made to some of the existing (ones already in the TPV shared DB) tools look ok, as most are close to what was already defined. However, there are a few exceptions where the new settings seem super low (3 to 4 times low), and I am a bit suspicious of that. For example, in signalp3, the change is from 10G to 0.67G (several others are also in the same range). Similarly, we must double-check for the schicexplorer_* tools. I think by setting the minimum mem to 3.8G, we can give them a try and increase when necessary.

Agree, those are really weird. One possibility that comes to mind is that the captured metrics are somehow wrong. Another perhaps is that the tool just wasn't used that much during the past year, and hence whatever it was used for didn't use much memory. I'll investigate the one's you've both flagged and see what the older data says across federation data.

Yup, I had the same thought regarding the metrics. I was unaware that the calculations were limited to last year's data. It could be because of either of these.

Thank you!

@sanjaysrikakulam
Copy link
Collaborator

sanjaysrikakulam commented Aug 13, 2024

@nuwang What do you think of the following idea?

  1. Address all the flagged entries
  2. Merge this into a new branch to test it in production. EU can deploy it live and see what happens. Because we feel that any number we set as "minimum" (2.5GB or 4GB) will still be arbitrary, and rather than adding more delays to this PR, we can try it out for a week. I also think that most of these "new" entries probably used the default mem 3.8G defined in the tool_defaults (I compared a few new entries to what we have in our TPV and see that we have not defined any explicit mem for these, and for some, we do have explicit rules; at the same, there are so many new entries that are being added in this PR that we do not have in our TPV, meaning they are using the tool_defaults)

@nuwang
Copy link
Member Author

nuwang commented Aug 13, 2024

@sanjaysrikakulam That sounds good to me. It'll take me a bit of time to address No. 1, because of some other pressing commitments, but I'll ping you when it's done. I like the plan you've outlined in No. 2. One thing - we should record the date we go live so that we can track actual vs expected gains. I'm hoping we can collectively author a paper on this.

@bgruening
Copy link
Member

@nuwang another thing that I wanted to mention but always forget ... when we looked last time at this ... and figured that our c-group values are off ... we integrated https://usegalaxy.eu/?tool_id=stress_ng ... with this tool you can request memory and CPU exactly. Our idea was to run this tool with different params and check what cgroups and Galaxy reports.

Maybe that is also useful for you.

@sanjaysrikakulam
Copy link
Collaborator

@sanjaysrikakulam That sounds good to me. It'll take me a bit of time to address No. 1, because of some other pressing commitments, but I'll ping you when it's done.

Sure! Let me know.

I like the plan you've outlined in No. 2. One thing - we should record the date we go live so that we can track actual vs expected gains. I'm hoping we can collectively author a paper on this.

Certainly! We need to track and monitor. A paper will be great, I look forward to it!

@sanjaysrikakulam
Copy link
Collaborator

sanjaysrikakulam commented Aug 14, 2024

@nuwang another thing that I wanted to mention but always forget ... when we looked last time at this ... and figured that our c-group values are off ... we integrated https://usegalaxy.eu/?tool_id=stress_ng ... with this tool you can request memory and CPU exactly. Our idea was to run this tool with different params and check what cgroups and Galaxy reports.

Maybe that is also useful for you.

I tried stress_ng for whatever reason, the job_metric_numeric is not recording the mem usage,

Galaxy command_line: stress-ng --timeout '10s' --cpu '4' --cpu-ops '4' --fork '4' --hdd '4' --vm '4' '' '' '' '' '' 2>&1 | tee "/data/jwd02f/main/072/458/72458662/outputs/dataset_f777687d-8a13-4567-acb3-6669c63cc40e.dat

galaxy=> select * from job_metric_numeric where job_id=72458662;
    id     |  job_id  | plugin |     metric_name      |    metric_value
-----------+----------+--------+----------------------+--------------------
 872402152 | 72458662 | core   | galaxy_slots         |          1.0000000
 872402153 | 72458662 | core   | galaxy_memory_mb     |       3891.0000000
 872402154 | 72458662 | core   | start_epoch          | 1723616153.0000000
 872402155 | 72458662 | core   | end_epoch            | 1723616205.0000000
 872402156 | 72458662 | core   | runtime_seconds      |         52.0000000
 872402157 | 72458662 | cgroup | cpu.stat.usage_usec  |   65951046.0000000
 872402158 | 72458662 | cgroup | cpu.stat.user_usec   |   41643975.0000000
 872402159 | 72458662 | cgroup | cpu.stat.system_usec |   24307071.0000000

I am not sure why this could be. Also, I do not see the Job Metrics section in the Galaxy. Also, I think the TPV conf needs to be updated for this tool to adopt the dynamic allocation of the cores and mem based on user input because it currently uses the TPV tool defaults (1 core and 3.8G mem).

-- Edit --
I ran the job for a longer duration (20 minutes) in that case I can see the Job Metrics section in the Galaxy UI for that job. However, the Cgroup memory is still not getting recorded. Could it be a bug? However, I see the Cgroup snippet in the galaxy_<job_id>.sh script.

Cgroup Snippet from the galaxy job script.

if [ -e "/proc/$$/cgroup" -a -d "/sys/fs/cgroup" -a ! -f "/sys/fs/cgroup/cgroup.controllers" ]; then
    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '($2=="cpuacct,cpu") || ($2=="cpu,cpuacct") {print $3}');
    
    if [ ! -e "/sys/fs/cgroup/cpu$cgroup_path/cpuacct.usage" ]; then
        cgroup_path="";
    fi;
    
    for f in /sys/fs/cgroup/{cpu\,cpuacct,cpuacct\,cpu}$cgroup_path/{cpu,cpuacct}.*; do
        if [ -f "$f" ]; then
            echo "__$(basename $f)__" >> /data/jwd05e/main/072/002/72002437/__instrument_cgroup__metrics;
            cat "$f" >> /data/jwd05e/main/072/002/72002437/__instrument_cgroup__metrics 2>/dev/null;
        fi;
    done;
    
    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '$2=="memory"{print $3}');
    
    if [ ! -e "/sys/fs/cgroup/memory$cgroup_path/memory.max_usage_in_bytes" ]; then
        cgroup_path="";
    fi;
    
    for f in /sys/fs/cgroup/memory$cgroup_path/memory.*; do
        echo "__$(basename $f)__" >> /data/jwd05e/main/072/002/72002437/__instrument_cgroup__metrics;
        cat "$f" >> /data/jwd05e/main/072/002/72002437/__instrument_cgroup__metrics 2>/dev/null;
    done;
fi

if [ -e "/proc/$$/cgroup" -a -f "/sys/fs/cgroup/cgroup.controllers" ]; then
    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '($1=="0") {print $3}');
    
    for f in /sys/fs/cgroup/${cgroup_path}/{cpu,memory}.*; do
        echo "__$(basename $f)__" >> /data/jwd05e/main/072/002/72002437/__instrument_cgroup__metrics;
        cat "$f" >> /data/jwd05e/main/072/002/72002437/__instrument_cgroup__metrics 2>/dev/null;
    done;
fi

We have CGroups V2 in use on the workers. Since we do not store the JWDs of successful jobs in the EU, I will test this on the ESG Galaxy instance to see whether the file __instrument_cgroup__metrics gets created. If not, I will create an issue upstream, and we can continue the discussion there.

@nuwang
Copy link
Member Author

nuwang commented Aug 14, 2024

Thanks both for pointing to this. For slurm, it's also necessary to configure cgroup settings so that a new cgroup is created before executing the job: usegalaxy-au/infrastructure#1374. I imagine something similar is necessary for HTCondor? You're still using that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants