Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mem variables #4692

Merged
merged 4 commits into from
Oct 14, 2024
Merged

Add mem variables #4692

merged 4 commits into from
Oct 14, 2024

Conversation

jedwards4b
Copy link
Contributor

PBS on derecho requires specifying memory requirement per node, this pr provides that capability. MEM_PER_TASK and MAX_MEM_PER_NODE are defined in cmeps, and supported here. This is easily extendable to other systems if needed.

Test suite: scripts_regression_tests, ERP_Ln9_P24x3.f45_f45_mg37.QPWmaC6.derecho_intel.cam-outfrq9s_mee_fluxes
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes
User interface changes?:

Update gh-pages html (Y/N)?:

@jedwards4b jedwards4b self-assigned this Oct 9, 2024
@@ -224,11 +224,33 @@ def get_job_overrides(self, job, case):
if thread_count:
overrides["thread_count"] = thread_count
else:
total_tasks = case.get_value("TOTALPES") * int(case.thread_count)
total_tasks = case.get_value("TOTALPES")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change scares me a lot. Looking at a couple lines below, I see total_tasks being multiplied by thread_count. It makes no sense how that ever worked because that would have the thread_count multiplied twice. So I would approve of the change if that was the only use of total_tasks, but it isn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because I the behavior I saw was that total_tasks was totalpesthread_countthread_count (twice). With this removed, total_tasks=totalpes*thread_count which I think is the intent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha so this was just a bug then. So task_count from a jobs override is equivalent to TOTALPES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b , I do see the multiply below on line 229 is definitely wrong if we keep the original code for line 227. Are you saying case.get_value("TOTALPES") already takes threads into account?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If task_count and TOTALPES are equivalent then this makes more sense.

            total_tasks = case.get_value("TOTALPES")
            thread_count = case.thread_count
        
        total_tasks *= thread_count
        
        if int(total_tasks) < case.get_value("MAX_TASKS_PER_NODE"):
            overrides["max_tasks_per_node"] = int(total_tasks)

@@ -224,11 +224,33 @@ def get_job_overrides(self, job, case):
if thread_count:
overrides["thread_count"] = thread_count
else:
total_tasks = case.get_value("TOTALPES") * int(case.thread_count)
total_tasks = case.get_value("TOTALPES")
thread_count = case.thread_count
if int(total_tasks) * int(thread_count) < case.get_value("MAX_TASKS_PER_NODE"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this to:

if int(total_tasks) < case.get_value("MAX_TASKS_PER_NODE"):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's correct.

try:
mem_per_task = case.get_value("MEM_PER_TASK")
max_mem_per_node = case.get_value("MAX_MEM_PER_NODE")
mem_per_node = total_tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this to:
mem_per_node = case.get_value("TOTALPES")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do that because total_tasks may be the product of an override - for example the case.st_archive script overrides total_tasks and changes it to 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about mem_per_node = total_tasks / thread_count ?

Copy link
Contributor

@jgfouca jgfouca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better, thanks!

@jedwards4b jedwards4b merged commit caeb01a into ESMCI:master Oct 14, 2024
7 checks passed
@jedwards4b jedwards4b deleted the add_mem_variables branch October 14, 2024 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants