Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job metric query forgets mem_bw #69

Open
fodinabor opened this issue Nov 24, 2022 · 11 comments
Open

Job metric query forgets mem_bw #69

fodinabor opened this issue Nov 24, 2022 · 11 comments
Assignees

Comments

@fodinabor
Copy link
Contributor

I have a weird issue where cc-backend does neither provide nor archive mem_bw.

Consider the following query (used on the job list view and the single job view, afaict):
/api/jobs/metrics/1685?metric=flops_any&metric=mem_bw&metric=cpu_load&metric=cpu_user&metric=mem_used&metric=clock&metric=cpu_power&metric=acc_utilization&metric=acc_mem_used&metric=acc_power&metric=disk_free&metric=net_bytes_in&metric=net_bytes_out&metric=nfs4_total&metric=nfs3_total&scope=node&scope=core

Returns the dump over here.
Where obviously mem_bw is missing from.

On the systems view, mem_bw is indeed shown, though.
/query is called with

{
  "query": "query ($cluster: String!, $nodes: [String!], $from: Time!, $to: Time!) {\n  nodeMetrics(cluster: $cluster, nodes: $nodes, from: $from, to: $to) {\n    host\n    subCluster\n    metrics {\n      name\n      metric {\n        timestep\n        scope\n        series {\n          statistics {\n            min\n            avg\n            max\n          }\n          data\n        }\n      }\n    }\n  }\n}\n",
  "variables": {
    "cluster": "test",
    "nodes": [
      "thera"
    ],
    "from": "2022-11-24T07:59:17.196Z",
    "to": "2022-11-24T08:29:17.196Z"
  }
}

The returned json can be seen in here.

Note, in an archived job from that machine, mem_bw is also missing, see here.

My cc-metric-store config contains:

"mem_bw":           { "frequency": 60, "aggregation": "sum" },

cluster.json:

        {
            "name": "mem_bw",
            "scope": "socket",
            "unit": "GB/s",
            "timestep": 60,
            "aggregation": "sum",
            "peak": 350,
            "normal": 100,
            "caution": 50,
            "alert": 10
        },
@moebiusband73
Copy link
Member

Which metric data backend do you use?
The only idea I have is that mem_bw is missing already there, but you said it is shown in system view.
I can only speculate, maybe the scope socket is missing?
Is there anything in the log?

@fodinabor
Copy link
Contributor Author

I use cc-metric-store.
The collector is e.g. configured with the following, which I'd say should provide the socket scope?

"metrics": [
                    {
                        "calc": "1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time",
                        "name": "mem_bw",
                        "publish": true,
                        "unit": "MB/s",
                        "scope": "socket"
                    }
                ]

Last time I checked, I didn't get any related errors, now I'm getting the following. For the /api/jobs/metrics/1636 it seems to complain about the missing data, while further down, the system's view query again is happy (except missing cpu_power, but this I currently indeed do not collect on that node).

Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /monitoring/job/1636 (200, 1.25kb, 1ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /global.css (200, 0.48kb, 0ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /uPlot.min.css (200, 0.76kb, 0ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/job.css (200, 0.32kb, 0ms)
Nov 25 12:08:33 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/job.js (200, 95.86kb, 156ms)
Nov 25 12:08:34 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 2.64kb, 1ms)
Nov 25 12:08:34 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [ERROR]   partial error: cc-metric-store: failed to fetch 'mem_bw' from host 'thera': metric or host not found, failed to fetch 'mem_bw' from host 'thera': metric or host not found, failed to fetch 'm>
Nov 25 12:08:35 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [INFO]    GET /api/jobs/metrics/1636?metric=flops_any&metric=mem_bw&metric=cpu_load&metric=cpu_user&metric=mem_used&metric=clock&metric=cpu_power&metric=acc_utilization&metric=acc_mem_used&metric=acc_>
Nov 25 12:08:50 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /img/logo.png (200, 15.67kb, 1ms)
Nov 25 12:08:51 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /favicon.png (200, 10.46kb, 0ms)
Nov 25 12:08:57 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [INFO]    map[analysis_view_histogramMetrics:[flops_any mem_bw acc_utilization] analysis_view_scatterPlotMetrics:[[flops_any mem_bw] [flops_any cpu_load] [cpu_load mem_bw]] job_view_nodestats_selected>
Nov 25 12:08:57 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /monitoring/node/test/thera (200, 1.26kb, 1ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /global.css (200, 0.48kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /uPlot.min.css (200, 0.76kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/node.css (200, 0.11kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/node.js (200, 70.80kb, 163ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /img/logo.png (200, 15.67kb, 1ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 2.39kb, 4ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [ERROR]   partial error: cc-metric-store: fetching cpu_power for node thera failed: metric or host not found
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 1.11kb, 5ms)

@spacehamster87
Copy link
Contributor

spacehamster87 commented Jan 13, 2023

Hi again @fodinabor ! I am investigating this issue at the moment, and I think I am onto something.

Could you please run your job-query as /api/jobs/metrics/1685?, i.e. without any query parameters, and then check for the mem_bw field?

The good news so far is that the issue does not seem to be connected with your cluster-/metric-configuration, as this would prevent the systems-view from successfully requesting / displaying the data (At least thats my current insight).

@fodinabor
Copy link
Contributor Author

Hi @spacehamster87, thanks for investigating!
If I run that query without any parameters, I don't get mem_bw either, see gist

@spacehamster87
Copy link
Contributor

Thanks for the feedback! Should've found your gist-post myself actually ... It was worth a shot.

The underlying query of the systems/status view and the direct jobs/metrics/{id} both use GraphQL afaict, but with slightly different methods in the backend, which might be the reason for the two different outcomes/results.

Namely LoadNodeData() @ metricdata/metricdata.go:211 for systems/status and LoadData() @ metricdata/metricdata.go:78 for the jobs/metrics/{id}-API.

I'll dig some more.

@spacehamster87
Copy link
Contributor

spacehamster87 commented Jan 16, 2023

After more digging, reproducing the error/case, and more logging in cc-metric-store, I think I have pinpointed the problem:

  • If querying in the Job-View, the smallest granularity, as defined in the cluster.json, will be requested - In mem_bws case: socket. But if the requested granularity cannot be provided by cc-metric-store it will return the beforementioned error instead.
  • The systems-view always and only requests the always available node granularity for each hostname, thus, returning data also for mem_bw

This also happens when the archiving starts and requests the latest data to write, in which mem_bw also will not return data.

To verify this, please try the following:

  1. Set mem_bw granularity in the cluster.json to node, restart cc-backend, then check the query result.
  2. With mem_bw granularity set to socket in the cluster.json query the API with : /api/jobs/metrics/1685?&metric=mem_bw&scope=socket

The fact that cc-metric-store returns an error and no usable data if the requested granularity does not match should definately be handled via a new issue in the respective repo

@fodinabor
Copy link
Contributor Author

fodinabor commented Jan 17, 2023

Jup, setting it to the node level works for new jobs.

Since job 1685 is archived (and it did not archive mem_bw) the query just returns empty.
For another (currently running job), with cluster.json's mem_bw scope set to socket, /api/jobs/metrics/8129?metric=mem_bw&scope=socket returns:

{"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found, failed to fetch 'mem_bw' from host 'thor': metric or host not found"}}

With cluster.json's mem_bw scope set to node the same query returns:

{"data":{"jobMetrics":[{"name":"mem_bw","metric":{"unit":"GB/s","scope":"node","timestep":60,"series":[{"hostname":"thor","statistics":{"min":0.60,"avg":0.71,"max":1.50},"data":[0.80,0.60,1.00,0.60,0.60,0.60,0.60,0.60,0.90,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,1.10,0.60,0.60,0.60,0.60,0.60,0.60,0.70,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.90,0.70,0.70,0.70,0.90,0.70,0.70,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.10,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.80,0.80,0.70,1.00,0.70,0.70,0.70,0.80,0.60,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.20,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.80,0.70,0.70,1.00,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.50,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,1.00,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,null,null,null,0.70,0.70,0.70,0.70,0.70,0.70,0.67,0.69,0.66,0.66,0.67,0.68,0.85,0.66,0.67,0.68,0.67,0.66,0.66,0.68,0.65]}],"statisticsSeries":null}}]},"error":null}

Might that be related to us not setting the hwthreads?
I.e. /query's job/resources/0/hwthreads is null.

We don't set that since we (currently) do not pin jobs to threads. Alternatively, we of course could just set that to a list of all hwthreads, getting node granularity after all... 🤷🏼

Edit:
interestingly, setting mem_bw scope to socket and just querying the scope node /api/jobs/metrics/8129?metric=mem_bw&scope=node also fails:

{"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found"}}

@spacehamster87
Copy link
Contributor

spacehamster87 commented Jan 18, 2023

Hi again! With this new information, the issue seems to be connected to your config after all: The topology-configuration in cluster.json should resemble this example, especially regarding the arrays for node, socket, memoryDomain and hwthread. The latter is still mentioned as core, but was renamed a while back, the linked example seems out of date ...

So for now, I see the following options to solve this issue:

  1. Use the node scope for mem_bw - Which is more of a workaround than a solution.
  2. Add hwthreads to the topology and re-check the configuration files of the whole stack. We are happy to have a look as well, if you can provide your files.
  3. Check which granularity is sent by the cc-metric-collector by directly querying the cc-metric-store API.

As for your edit: As soon as a smaller scope than node is set in the config files, cc-backend will try to request that scope, and then calculate the "actually requested" scope from the returned data. This is probably why socket as a requested scope for mem_bw fails, as it requires hwthread-data.

@spacehamster87 spacehamster87 self-assigned this Jan 18, 2023
@fodinabor
Copy link
Contributor Author

Some comments:

re 2.: we have the topology in our cluster.json, but we don't set hwthreads when we /start /stop the jobs. So I guess there are two options here: achieve same level of workaround as in 1. by sending /start a list of $[0,numthreads[$ as hwthreads, 2. consider pinning the threads and just sending that info to CC..
re 3.: the mem_bw metrics are collected on socket level (for some AMD nodes the LIKWID group converter apparently set it to hwthread, I changed it to socket now..), double checked that a few days ago.

We're still using core but the schema also mentions core not hwthread?
https://github.com/ClusterCockpit/cc-backend/blob/master/pkg/schema/schemas/cluster.schema.json#L167

@spacehamster87
Copy link
Contributor

Hi @fodinabor,

Sorry for this Issue to be stalled for some time now! As you've seen, we've been working hard to reach a solid release state.

With the recent 1.0.0 Release, and todays minor 1.1.0 update, I therefore wanted to ask if the issue still persists, or if you have found a solution on your side in the meantime.

@fodinabor
Copy link
Contributor Author

Hi @spacehamster87 ,
so far, we are using mem_bw only with level being node...
Were there changes that might make it worth to retest with granularity socket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants