Job metric query forgets `mem_bw` #69

fodinabor · 2022-11-24T08:44:06Z

I have a weird issue where cc-backend does neither provide nor archive mem_bw.

Consider the following query (used on the job list view and the single job view, afaict):
/api/jobs/metrics/1685?metric=flops_any&metric=mem_bw&metric=cpu_load&metric=cpu_user&metric=mem_used&metric=clock&metric=cpu_power&metric=acc_utilization&metric=acc_mem_used&metric=acc_power&metric=disk_free&metric=net_bytes_in&metric=net_bytes_out&metric=nfs4_total&metric=nfs3_total&scope=node&scope=core

Returns the dump over here.
Where obviously mem_bw is missing from.

On the systems view, mem_bw is indeed shown, though.
/query is called with

{
  "query": "query ($cluster: String!, $nodes: [String!], $from: Time!, $to: Time!) {\n  nodeMetrics(cluster: $cluster, nodes: $nodes, from: $from, to: $to) {\n    host\n    subCluster\n    metrics {\n      name\n      metric {\n        timestep\n        scope\n        series {\n          statistics {\n            min\n            avg\n            max\n          }\n          data\n        }\n      }\n    }\n  }\n}\n",
  "variables": {
    "cluster": "test",
    "nodes": [
      "thera"
    ],
    "from": "2022-11-24T07:59:17.196Z",
    "to": "2022-11-24T08:29:17.196Z"
  }
}

The returned json can be seen in here.

Note, in an archived job from that machine, mem_bw is also missing, see here.

My cc-metric-store config contains:

"mem_bw":           { "frequency": 60, "aggregation": "sum" },

cluster.json:

        {
            "name": "mem_bw",
            "scope": "socket",
            "unit": "GB/s",
            "timestep": 60,
            "aggregation": "sum",
            "peak": 350,
            "normal": 100,
            "caution": 50,
            "alert": 10
        },

The text was updated successfully, but these errors were encountered:

moebiusband73 · 2022-11-25T10:54:25Z

Which metric data backend do you use?
The only idea I have is that mem_bw is missing already there, but you said it is shown in system view.
I can only speculate, maybe the scope socket is missing?
Is there anything in the log?

fodinabor · 2022-11-25T11:13:53Z

I use cc-metric-store.
The collector is e.g. configured with the following, which I'd say should provide the socket scope?

"metrics": [
                    {
                        "calc": "1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time",
                        "name": "mem_bw",
                        "publish": true,
                        "unit": "MB/s",
                        "scope": "socket"
                    }
                ]

Last time I checked, I didn't get any related errors, now I'm getting the following. For the /api/jobs/metrics/1636 it seems to complain about the missing data, while further down, the system's view query again is happy (except missing cpu_power, but this I currently indeed do not collect on that node).

Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /monitoring/job/1636 (200, 1.25kb, 1ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /global.css (200, 0.48kb, 0ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /uPlot.min.css (200, 0.76kb, 0ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/job.css (200, 0.32kb, 0ms)
Nov 25 12:08:33 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/job.js (200, 95.86kb, 156ms)
Nov 25 12:08:34 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 2.64kb, 1ms)
Nov 25 12:08:34 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [ERROR]   partial error: cc-metric-store: failed to fetch 'mem_bw' from host 'thera': metric or host not found, failed to fetch 'mem_bw' from host 'thera': metric or host not found, failed to fetch 'm>
Nov 25 12:08:35 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [INFO]    GET /api/jobs/metrics/1636?metric=flops_any&metric=mem_bw&metric=cpu_load&metric=cpu_user&metric=mem_used&metric=clock&metric=cpu_power&metric=acc_utilization&metric=acc_mem_used&metric=acc_>
Nov 25 12:08:50 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /img/logo.png (200, 15.67kb, 1ms)
Nov 25 12:08:51 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /favicon.png (200, 10.46kb, 0ms)
Nov 25 12:08:57 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [INFO]    map[analysis_view_histogramMetrics:[flops_any mem_bw acc_utilization] analysis_view_scatterPlotMetrics:[[flops_any mem_bw] [flops_any cpu_load] [cpu_load mem_bw]] job_view_nodestats_selected>
Nov 25 12:08:57 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /monitoring/node/test/thera (200, 1.26kb, 1ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /global.css (200, 0.48kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /uPlot.min.css (200, 0.76kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/node.css (200, 0.11kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/node.js (200, 70.80kb, 163ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /img/logo.png (200, 15.67kb, 1ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 2.39kb, 4ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [ERROR]   partial error: cc-metric-store: fetching cpu_power for node thera failed: metric or host not found
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 1.11kb, 5ms)

spacehamster87 · 2023-01-13T15:27:58Z

Hi again @fodinabor ! I am investigating this issue at the moment, and I think I am onto something.

Could you please run your job-query as /api/jobs/metrics/1685?, i.e. without any query parameters, and then check for the mem_bw field?

The good news so far is that the issue does not seem to be connected with your cluster-/metric-configuration, as this would prevent the systems-view from successfully requesting / displaying the data (At least thats my current insight).

fodinabor · 2023-01-13T15:35:29Z

Hi @spacehamster87, thanks for investigating!
If I run that query without any parameters, I don't get mem_bw either, see gist

spacehamster87 · 2023-01-13T16:09:24Z

Thanks for the feedback! Should've found your gist-post myself actually ... It was worth a shot.

The underlying query of the systems/status view and the direct jobs/metrics/{id} both use GraphQL afaict, but with slightly different methods in the backend, which might be the reason for the two different outcomes/results.

Namely LoadNodeData() @ metricdata/metricdata.go:211 for systems/status and LoadData() @ metricdata/metricdata.go:78 for the jobs/metrics/{id}-API.

I'll dig some more.

spacehamster87 · 2023-01-16T16:21:29Z

After more digging, reproducing the error/case, and more logging in cc-metric-store, I think I have pinpointed the problem:

If querying in the Job-View, the smallest granularity, as defined in the cluster.json, will be requested - In mem_bws case: socket. But if the requested granularity cannot be provided by cc-metric-store it will return the beforementioned error instead.
The systems-view always and only requests the always available node granularity for each hostname, thus, returning data also for mem_bw

This also happens when the archiving starts and requests the latest data to write, in which mem_bw also will not return data.

To verify this, please try the following:

Set mem_bw granularity in the cluster.json to node, restart cc-backend, then check the query result.
With mem_bw granularity set to socket in the cluster.json query the API with : /api/jobs/metrics/1685?&metric=mem_bw&scope=socket

The fact that cc-metric-store returns an error and no usable data if the requested granularity does not match should definately be handled via a new issue in the respective repo

fodinabor · 2023-01-17T09:56:52Z

Jup, setting it to the node level works for new jobs.

Since job 1685 is archived (and it did not archive mem_bw) the query just returns empty.
For another (currently running job), with cluster.json's mem_bw scope set to socket, /api/jobs/metrics/8129?metric=mem_bw&scope=socket returns:

{"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found, failed to fetch 'mem_bw' from host 'thor': metric or host not found"}}

With cluster.json's mem_bw scope set to node the same query returns:

{"data":{"jobMetrics":[{"name":"mem_bw","metric":{"unit":"GB/s","scope":"node","timestep":60,"series":[{"hostname":"thor","statistics":{"min":0.60,"avg":0.71,"max":1.50},"data":[0.80,0.60,1.00,0.60,0.60,0.60,0.60,0.60,0.90,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,1.10,0.60,0.60,0.60,0.60,0.60,0.60,0.70,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.90,0.70,0.70,0.70,0.90,0.70,0.70,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.10,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.80,0.80,0.70,1.00,0.70,0.70,0.70,0.80,0.60,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.20,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.80,0.70,0.70,1.00,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.50,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,1.00,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,null,null,null,0.70,0.70,0.70,0.70,0.70,0.70,0.67,0.69,0.66,0.66,0.67,0.68,0.85,0.66,0.67,0.68,0.67,0.66,0.66,0.68,0.65]}],"statisticsSeries":null}}]},"error":null}

Might that be related to us not setting the hwthreads?
I.e. /query's job/resources/0/hwthreads is null.

We don't set that since we (currently) do not pin jobs to threads. Alternatively, we of course could just set that to a list of all hwthreads, getting node granularity after all... 🤷🏼

Edit:
interestingly, setting mem_bw scope to socket and just querying the scope node /api/jobs/metrics/8129?metric=mem_bw&scope=node also fails:

{"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found"}}

spacehamster87 · 2023-01-18T13:25:39Z

Hi again! With this new information, the issue seems to be connected to your config after all: The topology-configuration in cluster.json should resemble this example, especially regarding the arrays for node, socket, memoryDomain and hwthread. The latter is still mentioned as core, but was renamed a while back, the linked example seems out of date ...

So for now, I see the following options to solve this issue:

Use the node scope for mem_bw - Which is more of a workaround than a solution.
Add hwthreads to the topology and re-check the configuration files of the whole stack. We are happy to have a look as well, if you can provide your files.
Check which granularity is sent by the cc-metric-collector by directly querying the cc-metric-store API.

As for your edit: As soon as a smaller scope than node is set in the config files, cc-backend will try to request that scope, and then calculate the "actually requested" scope from the returned data. This is probably why socket as a requested scope for mem_bw fails, as it requires hwthread-data.

fodinabor · 2023-01-18T14:16:06Z

Some comments:

re 2.: we have the topology in our cluster.json, but we don't set hwthreads when we /start /stop the jobs. So I guess there are two options here: achieve same level of workaround as in 1. by sending /start a list of $[0,numthreads[$ as hwthreads, 2. consider pinning the threads and just sending that info to CC..
re 3.: the mem_bw metrics are collected on socket level (for some AMD nodes the LIKWID group converter apparently set it to hwthread, I changed it to socket now..), double checked that a few days ago.

We're still using core but the schema also mentions core not hwthread?
https://github.com/ClusterCockpit/cc-backend/blob/master/pkg/schema/schemas/cluster.schema.json#L167

spacehamster87 · 2023-07-20T08:58:21Z

Hi @fodinabor,

Sorry for this Issue to be stalled for some time now! As you've seen, we've been working hard to reach a solid release state.

With the recent 1.0.0 Release, and todays minor 1.1.0 update, I therefore wanted to ask if the issue still persists, or if you have found a solution on your side in the meantime.

fodinabor · 2023-07-21T13:05:11Z

Hi @spacehamster87 ,
so far, we are using mem_bw only with level being node...
Were there changes that might make it worth to retest with granularity socket?

fodinabor mentioned this issue Nov 24, 2022

Job view dies if mem_bw metric not available #70

Closed

spacehamster87 self-assigned this Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job metric query forgets `mem_bw` #69

Job metric query forgets `mem_bw` #69

fodinabor commented Nov 24, 2022

moebiusband73 commented Nov 25, 2022

fodinabor commented Nov 25, 2022

spacehamster87 commented Jan 13, 2023 •

edited

Loading

fodinabor commented Jan 13, 2023

spacehamster87 commented Jan 13, 2023

spacehamster87 commented Jan 16, 2023 •

edited

Loading

fodinabor commented Jan 17, 2023 •

edited

Loading

spacehamster87 commented Jan 18, 2023 •

edited

Loading

fodinabor commented Jan 18, 2023

spacehamster87 commented Jul 20, 2023

fodinabor commented Jul 21, 2023

Job metric query forgets mem_bw #69

Job metric query forgets mem_bw #69

Comments

fodinabor commented Nov 24, 2022

moebiusband73 commented Nov 25, 2022

fodinabor commented Nov 25, 2022

spacehamster87 commented Jan 13, 2023 • edited Loading

fodinabor commented Jan 13, 2023

spacehamster87 commented Jan 13, 2023

spacehamster87 commented Jan 16, 2023 • edited Loading

fodinabor commented Jan 17, 2023 • edited Loading

spacehamster87 commented Jan 18, 2023 • edited Loading

fodinabor commented Jan 18, 2023

spacehamster87 commented Jul 20, 2023

fodinabor commented Jul 21, 2023

Job metric query forgets `mem_bw` #69

Job metric query forgets `mem_bw` #69

spacehamster87 commented Jan 13, 2023 •

edited

Loading

spacehamster87 commented Jan 16, 2023 •

edited

Loading

fodinabor commented Jan 17, 2023 •

edited

Loading

spacehamster87 commented Jan 18, 2023 •

edited

Loading