-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job metric query forgets mem_bw
#69
Comments
Which metric data backend do you use? |
I use cc-metric-store. "metrics": [
{
"calc": "1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time",
"name": "mem_bw",
"publish": true,
"unit": "MB/s",
"scope": "socket"
}
] Last time I checked, I didn't get any related errors, now I'm getting the following. For the
|
Hi again @fodinabor ! I am investigating this issue at the moment, and I think I am onto something. Could you please run your job-query as The good news so far is that the issue does not seem to be connected with your cluster-/metric-configuration, as this would prevent the systems-view from successfully requesting / displaying the data (At least thats my current insight). |
Hi @spacehamster87, thanks for investigating! |
Thanks for the feedback! Should've found your gist-post myself actually ... It was worth a shot. The underlying query of the systems/status view and the direct Namely I'll dig some more. |
After more digging, reproducing the error/case, and more logging in cc-metric-store, I think I have pinpointed the problem:
This also happens when the archiving starts and requests the latest data to write, in which To verify this, please try the following:
The fact that |
Jup, setting it to the node level works for new jobs. Since job 1685 is archived (and it did not archive {"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found, failed to fetch 'mem_bw' from host 'thor': metric or host not found"}} With cluster.json's {"data":{"jobMetrics":[{"name":"mem_bw","metric":{"unit":"GB/s","scope":"node","timestep":60,"series":[{"hostname":"thor","statistics":{"min":0.60,"avg":0.71,"max":1.50},"data":[0.80,0.60,1.00,0.60,0.60,0.60,0.60,0.60,0.90,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,1.10,0.60,0.60,0.60,0.60,0.60,0.60,0.70,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.90,0.70,0.70,0.70,0.90,0.70,0.70,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.10,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.80,0.80,0.70,1.00,0.70,0.70,0.70,0.80,0.60,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.20,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.80,0.70,0.70,1.00,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.50,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,1.00,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,null,null,null,0.70,0.70,0.70,0.70,0.70,0.70,0.67,0.69,0.66,0.66,0.67,0.68,0.85,0.66,0.67,0.68,0.67,0.66,0.66,0.68,0.65]}],"statisticsSeries":null}}]},"error":null} Might that be related to us not setting the We don't set that since we (currently) do not pin jobs to threads. Alternatively, we of course could just set that to a list of all hwthreads, getting node granularity after all... 🤷🏼 Edit: {"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found"}} |
Hi again! With this new information, the issue seems to be connected to your config after all: The topology-configuration in So for now, I see the following options to solve this issue:
As for your edit: As soon as a smaller scope than |
Some comments: re 2.: we have the topology in our cluster.json, but we don't set We're still using |
Hi @fodinabor, Sorry for this Issue to be stalled for some time now! As you've seen, we've been working hard to reach a solid release state. With the recent 1.0.0 Release, and todays minor 1.1.0 update, I therefore wanted to ask if the issue still persists, or if you have found a solution on your side in the meantime. |
Hi @spacehamster87 , |
I have a weird issue where cc-backend does neither provide nor archive
mem_bw
.Consider the following query (used on the job list view and the single job view, afaict):
/api/jobs/metrics/1685?metric=flops_any&metric=mem_bw&metric=cpu_load&metric=cpu_user&metric=mem_used&metric=clock&metric=cpu_power&metric=acc_utilization&metric=acc_mem_used&metric=acc_power&metric=disk_free&metric=net_bytes_in&metric=net_bytes_out&metric=nfs4_total&metric=nfs3_total&scope=node&scope=core
Returns the dump over here.
Where obviously
mem_bw
is missing from.On the systems view,
mem_bw
is indeed shown, though./query
is called withThe returned json can be seen in here.
Note, in an archived job from that machine,
mem_bw
is also missing, see here.My cc-metric-store config contains:
cluster.json:
The text was updated successfully, but these errors were encountered: