Increase memory for cat_bins #1317

paulzierep · 2024-09-19T10:22:40Z

Some of our cat_bin jobs were failing, most probably due to memory issues. I was looking for a while to find a logic that allows to define how to improve the rool for this (and other) tools. That's what I came up with:

A) Query the tool-memory-per-inputs, but include the job state. Will try to add this option to gxadmin !

WITH job_cte AS (
    SELECT
        j.id,
        j.tool_id,
        j.state -- Include job state in the CTE
    FROM
        job j
    WHERE
        j.tool_id LIKE 'toolshed.g2.bx.psu.edu/repos/iuc/cat_bins/cat_bins/5.2.3+galaxy0'
        -- Removed the state filter to include all states
),
mem_cte AS (
    SELECT
        j.id,
        jmn.metric_value AS memory_used
    FROM
        job_cte j
    JOIN
        job_metric_numeric jmn ON j.id = jmn.job_id
    WHERE
        jmn.plugin = 'cgroup'
        AND
        jmn.metric_name = 'memory.memsw.max_usage_in_bytes'
),
data_cte AS (
    SELECT
        j.id,
        COUNT(jtid.id) AS input_count,
        SUM(d.total_size) AS total_input_size,
        AVG(d.total_size) AS mean_input_size,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY d.total_size) AS median_input_size
    FROM
        job_cte j
    JOIN
        job_to_input_dataset jtid ON j.id = jtid.job_id
    JOIN
        history_dataset_association hda ON jtid.dataset_id = hda.id
    JOIN
        dataset d ON hda.dataset_id = d.id
    GROUP BY
        j.id
)
SELECT
    j.id,
    j.tool_id,
    j.state, -- Include job state in the output
    d.input_count,
    (d.total_input_size / 1024 / 1024)::bigint AS total_input_size_mb,
    (d.mean_input_size / 1024 / 1024)::bigint AS mean_input_size_mb,
    (d.median_input_size / 1024 / 1024)::bigint AS median_input_size_mb,
    (m.memory_used / 1024 / 1024)::bigint AS memory_used_mb,
    (m.memory_used / NULLIF(d.total_input_size, 0))::bigint AS memory_used_per_input_mb,
    (m.memory_used / NULLIF(d.mean_input_size, 0))::bigint AS memory_mean_input_ratio,
    (m.memory_used / NULLIF(d.median_input_size, 0))::bigint AS memory_median_input_ratio
FROM
    job_cte j
JOIN
    mem_cte m ON j.id = m.id
JOIN
    data_cte d ON j.id = d.id
ORDER BY
    j.id DESC;

B) Plot memory vs total input for all states:

10 % quantile threshold for the error states: 31.0 mb
Percentage below threshold of error state jobs: 0.125
Percentage below threshold of ok state jobs: 0.723

It is clear that many of the error states have higher input size.

With the new rule. > 70% of the ok state jobs would have run with the default 24 GB memory. But for almost 90 % of the failed states, the memory would have been increase - giving them a better chance to succeed. But surely some of them could fail due to other reasons.

If that makes sense I will increase some more tool memory based on the same logic.

One question I got is, that I cannot observe jobs with higher memory for all failed jobs - so I guess the increasing memory for failed jobs rule is not in place after all, or am I missing something ?

bgruening · 2024-09-19T11:12:52Z

files/galaxy/tpv/tools.yml

@@ -340,6 +340,12 @@ tools:
  toolshed.g2.bx.psu.edu/repos/iuc/fgsea/fgsea/.*:
    # any container should work
    inherits: basic_docker_tool
+  toolshed.g2.bx.psu.edu/repos/iuc/cat_bins/cat_bins/.*:
+    rules:
+      - if: input_size >= 0.03


Can you provide each rule an ID please

sorry, but what is a rule ID ? I made this based on

infrastructure-playbook/files/galaxy/tpv/tools.yml

Line 212 in c383c13

- if: input_size >= 0.01

https://github.com/galaxyproject/tpv-shared-database/pull/69/files#diff-468b3b0e5865d8c16bf21e41a6178e039023c395401e5413e31bb9f7b3f02cf6R308

bgruening · 2024-09-19T11:16:28Z

One question I got is, that I cannot observe jobs with higher memory for all failed jobs - so I guess the increasing memory for failed jobs rule is not in place after all, or am I missing something ?

That is a good question.
I think what is happening is, that Galaxy write the requested job requirement to disk, submit the first job. Job is now running and crashes. Condor, NOT Galaxy, is resubmitting the job with higher memory. And because Condor is doing this, Galaxy does not know about it and does not report it back. Does that make sense?

paulzierep · 2024-09-19T11:24:26Z

One question I got is, that I cannot observe jobs with higher memory for all failed jobs - so I guess the increasing memory for failed jobs rule is not in place after all, or am I missing something ?

That is a good question. I think what is happening is, that Galaxy write the requested job requirement to disk, submit the first job. Job is now running and crashes. Condor, NOT Galaxy, is resubmitting the job with higher memory. And because Condor is doing this, Galaxy does not know about it and does not report it back. Does that make sense?

This would be an explanation, but would this mean, that such as job would be stored in the Galaxy DB with i.e. 24 GB memory, but in real they used 48 ? - That will make defining good rules really difficult. On the other hand, it makes me question why there are jobs that got more memory then 24 gb at all?

bgruening · 2024-09-19T11:46:43Z

Depends on what you are looking for. If you look at the allocated memory, I think the allocation from the first run is reported. But we should look at the cgroup reported valued of the actual consumption. But this second value is broken, its a bug that @sanjaysrikakulam is trying to fix soon.

paulzierep · 2024-09-19T11:51:41Z

Depends on what you are looking for. If you look at the allocated memory, I think the allocation from the first run is reported. But we should look at the cgroup reported valued of the actual consumption. But this second value is broken, its a bug that @sanjaysrikakulam is trying to fix soon.

I guess it probably does not matter much anyway. Whether the rule was applied or not, they needed more memory to succeed, which they have now.

paulzierep · 2024-09-19T13:03:36Z

Am I free to merge this ?

sanjaysrikakulam · 2024-09-20T09:23:16Z

It will be deployed over the weekend.

add dynamic rule

c383c13

paulzierep requested review from sanjaysrikakulam and bgruening September 19, 2024 10:22

bgruening reviewed Sep 19, 2024

View reviewed changes

add rule id

a29a4ba

bgruening approved these changes Sep 19, 2024

View reviewed changes

sanjaysrikakulam approved these changes Sep 20, 2024

View reviewed changes

sanjaysrikakulam merged commit 9efd5fb into usegalaxy-eu:master Sep 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase memory for cat_bins #1317

Increase memory for cat_bins #1317

paulzierep commented Sep 19, 2024

bgruening Sep 19, 2024

paulzierep Sep 19, 2024

bgruening Sep 19, 2024

bgruening commented Sep 19, 2024

paulzierep commented Sep 19, 2024

bgruening commented Sep 19, 2024

paulzierep commented Sep 19, 2024

paulzierep commented Sep 19, 2024

sanjaysrikakulam commented Sep 20, 2024

Increase memory for cat_bins #1317

Increase memory for cat_bins #1317

Conversation

paulzierep commented Sep 19, 2024

bgruening Sep 19, 2024

Choose a reason for hiding this comment

paulzierep Sep 19, 2024

Choose a reason for hiding this comment

bgruening Sep 19, 2024

Choose a reason for hiding this comment

bgruening commented Sep 19, 2024

paulzierep commented Sep 19, 2024

bgruening commented Sep 19, 2024

paulzierep commented Sep 19, 2024

paulzierep commented Sep 19, 2024

sanjaysrikakulam commented Sep 20, 2024