Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase memory for cat_bins #1317

Merged

Conversation

paulzierep
Copy link
Contributor

Some of our cat_bin jobs were failing, most probably due to memory issues. I was looking for a while to find a logic that allows to define how to improve the rool for this (and other) tools. That's what I came up with:

A) Query the tool-memory-per-inputs, but include the job state. Will try to add this option to gxadmin !

WITH job_cte AS (
    SELECT
        j.id,
        j.tool_id,
        j.state -- Include job state in the CTE
    FROM
        job j
    WHERE
        j.tool_id LIKE 'toolshed.g2.bx.psu.edu/repos/iuc/cat_bins/cat_bins/5.2.3+galaxy0'
        -- Removed the state filter to include all states
),
mem_cte AS (
    SELECT
        j.id,
        jmn.metric_value AS memory_used
    FROM
        job_cte j
    JOIN
        job_metric_numeric jmn ON j.id = jmn.job_id
    WHERE
        jmn.plugin = 'cgroup'
        AND
        jmn.metric_name = 'memory.memsw.max_usage_in_bytes'
),
data_cte AS (
    SELECT
        j.id,
        COUNT(jtid.id) AS input_count,
        SUM(d.total_size) AS total_input_size,
        AVG(d.total_size) AS mean_input_size,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY d.total_size) AS median_input_size
    FROM
        job_cte j
    JOIN
        job_to_input_dataset jtid ON j.id = jtid.job_id
    JOIN
        history_dataset_association hda ON jtid.dataset_id = hda.id
    JOIN
        dataset d ON hda.dataset_id = d.id
    GROUP BY
        j.id
)
SELECT
    j.id,
    j.tool_id,
    j.state, -- Include job state in the output
    d.input_count,
    (d.total_input_size / 1024 / 1024)::bigint AS total_input_size_mb,
    (d.mean_input_size / 1024 / 1024)::bigint AS mean_input_size_mb,
    (d.median_input_size / 1024 / 1024)::bigint AS median_input_size_mb,
    (m.memory_used / 1024 / 1024)::bigint AS memory_used_mb,
    (m.memory_used / NULLIF(d.total_input_size, 0))::bigint AS memory_used_per_input_mb,
    (m.memory_used / NULLIF(d.mean_input_size, 0))::bigint AS memory_mean_input_ratio,
    (m.memory_used / NULLIF(d.median_input_size, 0))::bigint AS memory_median_input_ratio
FROM
    job_cte j
JOIN
    mem_cte m ON j.id = m.id
JOIN
    data_cte d ON j.id = d.id
ORDER BY
    j.id DESC;

B) Plot memory vs total input for all states:

image

10 % quantile threshold for the error states: 31.0 mb
Percentage below threshold of error state jobs: 0.125
Percentage below threshold of ok state jobs: 0.723

It is clear that many of the error states have higher input size.

image

With the new rule. > 70% of the ok state jobs would have run with the default 24 GB memory. But for almost 90 % of the failed states, the memory would have been increase - giving them a better chance to succeed. But surely some of them could fail due to other reasons.

If that makes sense I will increase some more tool memory based on the same logic.

One question I got is, that I cannot observe jobs with higher memory for all failed jobs - so I guess the increasing memory for failed jobs rule is not in place after all, or am I missing something ?

@@ -340,6 +340,12 @@ tools:
toolshed.g2.bx.psu.edu/repos/iuc/fgsea/fgsea/.*:
# any container should work
inherits: basic_docker_tool
toolshed.g2.bx.psu.edu/repos/iuc/cat_bins/cat_bins/.*:
rules:
- if: input_size >= 0.03
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide each rule an ID please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, but what is a rule ID ? I made this based on

- if: input_size >= 0.01

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening
Copy link
Member

One question I got is, that I cannot observe jobs with higher memory for all failed jobs - so I guess the increasing memory for failed jobs rule is not in place after all, or am I missing something ?

That is a good question.
I think what is happening is, that Galaxy write the requested job requirement to disk, submit the first job. Job is now running and crashes. Condor, NOT Galaxy, is resubmitting the job with higher memory. And because Condor is doing this, Galaxy does not know about it and does not report it back. Does that make sense?

@paulzierep
Copy link
Contributor Author

One question I got is, that I cannot observe jobs with higher memory for all failed jobs - so I guess the increasing memory for failed jobs rule is not in place after all, or am I missing something ?

That is a good question. I think what is happening is, that Galaxy write the requested job requirement to disk, submit the first job. Job is now running and crashes. Condor, NOT Galaxy, is resubmitting the job with higher memory. And because Condor is doing this, Galaxy does not know about it and does not report it back. Does that make sense?

This would be an explanation, but would this mean, that such as job would be stored in the Galaxy DB with i.e. 24 GB memory, but in real they used 48 ? - That will make defining good rules really difficult. On the other hand, it makes me question why there are jobs that got more memory then 24 gb at all?

@bgruening
Copy link
Member

Depends on what you are looking for. If you look at the allocated memory, I think the allocation from the first run is reported. But we should look at the cgroup reported valued of the actual consumption. But this second value is broken, its a bug that @sanjaysrikakulam is trying to fix soon.

@paulzierep
Copy link
Contributor Author

Depends on what you are looking for. If you look at the allocated memory, I think the allocation from the first run is reported. But we should look at the cgroup reported valued of the actual consumption. But this second value is broken, its a bug that @sanjaysrikakulam is trying to fix soon.

I guess it probably does not matter much anyway. Whether the rule was applied or not, they needed more memory to succeed, which they have now.

@paulzierep
Copy link
Contributor Author

Am I free to merge this ?

@sanjaysrikakulam
Copy link
Member

It will be deployed over the weekend.

@sanjaysrikakulam sanjaysrikakulam merged commit 9efd5fb into usegalaxy-eu:master Sep 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants