Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitting bootstrap run in batches can leave lots of unused compute #695

Open
kylebaron opened this issue May 31, 2024 · 2 comments
Open
Labels
bootstrap Bootstrap development

Comments

@kylebaron
Copy link
Contributor

kylebaron commented May 31, 2024

I think this is submitting in batches of 100; sometimes we have 30 runs going, sometimes we have 100. But there's 192 worker cores available. I'm not sure how we got so many workers??

This set seems to have recruited lots of unneeded compute

30ish runs active

internal:~/project.mrg/xy/current/models/pk/106-boot$ qstat -f |grep amd; qstat |grep -c Run
[email protected] BIP   0/0/16         8.05     lx-amd64
[email protected] BIP   0/4/16         8.19     lx-amd64
[email protected] BIP   0/3/16         8.05     lx-amd64
[email protected] BIP   0/3/16         7.84     lx-amd64
[email protected] BIP   0/2/16         8.14     lx-amd64
[email protected] BIP   0/3/16         8.07     lx-amd64
[email protected] BIP   0/1/16         7.68     lx-amd64
[email protected] BIP   0/3/16         7.61     lx-amd64
[email protected] BIP   0/4/16         8.02     lx-amd64
[email protected] BIP   0/4/16         7.82     lx-amd64
[email protected] BIP   0/2/16         8.00     lx-amd64
[email protected] BIP   0/5/16         7.78     lx-amd64
34 # <--- number of total runs going

100 runs active

$ qstat -f |grep amd; qstat |grep -c Run
[email protected] BIP   0/8/16         7.59     lx-amd64
[email protected] BIP   0/4/16         7.67     lx-amd64
[email protected] BIP   0/5/16         7.50     lx-amd64
[email protected] BIP   0/8/16         7.56     lx-amd64
[email protected] BIP   0/7/16         7.32     lx-amd64
[email protected] BIP   0/5/16         7.59     lx-amd64
[email protected] BIP   0/7/16         7.28     lx-amd64
[email protected] BIP   0/6/16         7.44     lx-amd64
[email protected] BIP   0/5/16         7.41     lx-amd64
[email protected] BIP   0/5/16         7.43     lx-amd64
[email protected] BIP   0/8/16         7.98     lx-amd64
[email protected] BIP   0/4/16         7.56     lx-amd64
100 # <--- number of total runs going

This set was appropriately scaled, but new jobs can get scheduled until every run of previous batch finishes

Not terrible for this example that runs fast and is easy; this won't work when you get a much more complicated model and the variability in run time is large, with some runs taking very long to finish

internal:~$ qstat -f |grep amd; qstat |grep -c Run
[email protected] BIP   0/0/16         5.35     lx-amd64
[email protected] BIP   0/0/16         5.45     lx-amd64
[email protected] BIP   0/0/16         5.44     lx-amd64
[email protected] BIP   0/0/16         6.06     lx-amd64
[email protected] BIP   0/0/16         5.71     lx-amd64
[email protected] BIP   0/1/16         5.60     lx-amd64
[email protected] BIP   0/1/16         5.29     lx-amd64
2

The run ended up with additional compute; I'm not sure why. This isn't an issue for bbr to solve, but wanted to document this was happening.

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.46     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.79     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.62     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         5.24     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.46     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.57     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.86     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.58     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         5.94     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.65     lx-amd64
---------------------------------------------------------------------------------
[email protected] BIP   0/0/16         6.00     lx-amd64
@seth127
Copy link
Collaborator

seth127 commented May 31, 2024

Thanks for capturing this @kylebaron do you think we should move this to an internal Metworx ticket, or do you think it's worth looking into whether the way bbr is submitting models is playing into this?

@seth127 seth127 added the bootstrap Bootstrap development label May 31, 2024
@kylebaron
Copy link
Contributor Author

I think this part is relevant to the way that bbr is doing it; you can get some really skewed run times so I think this batching strategy will have problems sooner than later

This set was appropriately scaled, but new jobs can get scheduled until every run of previous batch finishes

Not terrible for this example that runs fast and is easy; this won't work when you get a much more complicated model and the variability in run time is large, with some runs taking very long to finish

internal:~$ qstat -f |grep amd; qstat |grep -c Run
[email protected] BIP   0/0/16         5.35     lx-amd64
[email protected] BIP   0/0/16         5.45     lx-amd64
[email protected] BIP   0/0/16         5.44     lx-amd64
[email protected] BIP   0/0/16         6.06     lx-amd64
[email protected] BIP   0/0/16         5.71     lx-amd64
[email protected] BIP   0/1/16         5.60     lx-amd64
[email protected] BIP   0/1/16         5.29     lx-amd64
2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bootstrap Bootstrap development
Projects
None yet
Development

No branches or pull requests

2 participants