Skip to content

WMStats details

Valentin Kuznetsov edited this page Jan 9, 2023 · 5 revisions

WMStats info

This document summaries different topics about WMStats server following issue#11411. Below is a list of action items required for design discussion:

  • [DONE] current data stored in the database (including required vs likely non-required data)
  • [DONE] how data gets published to the wmstats database
  • [DONE] functionalities provided by WMStats (e.g., ACDC creation)
  • [DONE] APIs used to load data into WMStats
  • [DONE] data structure of the data loaded into WMStats
  • [DONE] weak and/or missing functionalities
  • [DONE] proposal of new WMStats UI server implementation

WMStats information

The WMStats server is deployed at cmsweb cluster and provides data via the following URLs

# WMStats cache server
scurl -s https://cmsweb.cern.ch/wmstatsserver

# WMStats info API
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/info

# cache (most heavy) API which returns everything
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/requestcache

# job detail API example
https://cmsweb-testbed.cern.ch/wmstatsserver/data/jobdetail/tivanov_TC_6Tasks_Scratch_HG2301_Val_230109_081126_7143

The WMStats server APIs are defined in RestApihub.py

It fetches stores and fetches the data from underlying CouchDB. The CouchDB stores unstructured or semi-structured data in JSON data-format and does not impose any schema for stored data. Therefore, the data structure of stored data is driven by external services, e.g. the WMAgent info can be obtained via this URL call:

https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl7/_view/agentInfo

while workflow information can be obtained as following:

scurl -s -X POST -H "Content-type: application/json" -d '{"keys": ["pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816"]}' "https://cmsweb.cern.ch:8443/couchdb/reqmgr_workload_cache/_all_docs?include_docs=True" | jq

The CouchDB reqmgr_workload_cache database stores high level description of workflow, while other parts of information come from wmstats couchdb database, mostly related to workflow evolution tasks and jobs, see DataCacheUpdate.py.

In both cases above the structure of stored documents is defined somwwhere in WMCore code base. Said that, the structure among documents within single collection of documents remain the same, e.g. all workflows documents will have common keys/values, but not all keys are mandatory. For example, Task sub-structure of workflow document only appear if there are tasks in given workflow, but it may be skipped if they do not exist.

How data gets published to the WMStats database

The WMStats database resides in CouchDB and called reqmgr_workload_cache. The data flows to this database comes from ReqMgr2 service which insert the initial data about request workflows, and later are updated by ReqMgr2 (during request's transitions), GlobalWorkqueue and WMAgent. The workflow JSON document comes from an initial standard spec. see StdSpecs area. Each workflow request is create via ReqMgr service which provides create/update methods, see Request.py. For more information about data flow in CouchDB please refer to this graph.

Functionalities provided by WMStats UI server

The WMStats UI server provide the following functionalities:

  • list all known workflows in a system along with their aggregation information such as number of processed events, lumis, failure rate etc
  • current status of all WMAgents
  • aggregated information about Campaigns, Sites, CMSSW releases and WMAgents
  • Infomration about failures in various requests
  • information about logs of individual requests
  • various search and filter capabilities about requests, campaings, sites, releases, etc.

WMStats code base and APIs

WMStats reader

The wmstats server code comes from DataCacheUpdate.py module which calls WMStatsReader.py

The individual data comes from different APIs, e.g. RequestDBReader.getRequestByStatus where RequestDBReader.py calls couchdb bystatus view.

For example (here and further we use cmsweb-test9 as an example and this URL may be directly applied to cmsweb, or cmsweb-testbed which will only provide more data):

# get list of workflows:
scurl -s https://cmsweb-test9.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus?include_docs=false
{"total_rows":47,"offset":0,"rows":[
{"id":"amaltaro_DQMHarvest_RunWhitelist_Agent214_Val_221216_110404_1633","key":"assigned","value":1671188646},
{"id":"amaltaro_ReReco_RunBlockWhite_Agent214_Val_221216_110415_6984","key":"assigned","value":1671188657},
...

To get agent info we'll call the following view

scurl -s https://cmsweb-test9.cern.ch/couchdb/wmstats/_design/WMStatsErl7/_view/agentInfo | jq | head -20
{
  "total_rows": 9,
  "offset": 0,
  "rows": [
    {
      "id": "global_workqueue",
      "key": "global_workqueue",
      "value": {
        "_id": "global_workqueue",
        "_rev": "681-6a1bece20a50c1c56b6bea95fb29e961",
        "agent_url": "global_workqueue",
        "agent_team": "",
        "agent_version": "2.1.6rc3",
        "timestamp": 1671462286,
        "down_components": [],
        "type": "agent_info",
        "down_component_detail": [],
        ...
WMStats writer

The WMStatsWriter.py is responsible for writing data to CouchDB. This class is used by different WMCore components like:

CouchDB views

In order to understand which views are used we may find them in couchdb log

grep _design /data/srv/logs/couchdb/couch.log | grep couchdb.couchdb | grep -v logdb | awk '{print $9,$10}' | head

or, we can list of all views in particular DB: https://cmsweb.cern.ch:8443/couchdb/wmstats/_design_docs

To get db view we need: The URI to query to get a view's result is /database/_design/designdocname/_view/viewname For example:

scurl https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl | jq
{
  "_id": "_design/WMStatsErl",
  "_rev": "10-01ea14e3d6042a7e0a480e0b9e485c3b",
  "language": "erlang",
  "views": {
    "requestAgentUrl": {
      "map": "fun({Doc}) ->\n  DocType = couch_util:get_value(<<\"type\">>, Doc),\n  case DocType of\n    undefined -> ok;\n    <<\"agent_request\">> ->\n      AgentUrl = couch_util:get_value(<<\"agent_url\">>, Doc),\n      Workflow = couch_util:get_value(<<\"workflow\">>, Doc),\n      Emit([Workflow, AgentUrl], null);\n    _ -> ok\n  end\nend.",
      "reduce": "_count"
    }
  },
  "couchapp": {
    "manifest": [
      "language",
      "views/",
      "views/requestAgentUrl/",
      "views/requestAgentUrl/map.erl",
      "views/requestAgentUrl/reduce.erl"
    ],
    "objects": {},
    "signatures": {}
  }
}

Here the designdocname is WMStatsErl, the view name is requestAgentUrl. Each view has map and reduce section of the document. The map stores the map function which can be either written in JavaScript or Erlang (the language CouchDB is written), while reduce provides reduce function used together with a given map, and it can be skipped.

To list all docs in given view we do:

https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl1/_view/byAgentURL

Or, we can pass filter to get specific keys:

scurl https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl1/_view/byAgentURL -X POST -H "Content-type: application/json" -d '{"keys": ["vocms0281.cern.ch"]}'

Or, we can view the grouping

https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl3/_view/jobsByStatusWorkflow?group=true

WMStats documents

The actual document about given workflow which we can see from ReqMgr2 page

https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816

can be fetched directly from CouchDB as following:

scurl -s -X POST -H "Content-type: application/json" -d '{"keys": ["pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816"]}' "https://cmsweb.cern.ch:8443/couchdb/reqmgr_workload_cache/_all_docs?include_docs=True" | jq

Data structure of the data loaded into WMStats

The WMStats UI server gets its data from WMStats cache server. The latter provides full list of JSON docs as single list which leads to significant size of the payload data.

A single WMStats JSON record is quite large and its data structure can be seen over here. It consists of various attributes describing current workflow state, e.g. its name, prepID, Tasks, etc. All attributes follow CamelCase naming convention exept AgentJobInfo which mixes CamelCase with underscores naming conventions, e.g.

    "AgentJobInfo": {
      "cmsgwms-submit6.fnal.gov": {
        "_id": "cmsgwms-submit6.fnal.gov-cmsunified_ACDC0_task_EXO-RunIISummer20UL18wmLHEGEN-00824__v1_T_220425_125613_9010",
        "_rev": "274-4282735881",
        "agent_team": "production",
        "agent_version": "2.0.2.patch1",
        "agent": "WMAgent",
        "agent_url": "cmsgwms-submit6.fnal.gov",
        "type": "agent_request",
        "workflow": "cmsunified_ACDC0_task_EXO-RunIISummer20UL18wmLHEGEN-00824__v1_T_220425_125613_9010",
        "status": {
          "inQueue": 2
        },
        ...

weak and/or missing functionalities

The production WMStats server data is quite large, e.g.

scurl -s https://cmsweb.cern.ch/wmstatsserver/data/requestcache

takes 1.5 min to fetch and its size is around 300MB. Loading such large JSON into RAM will require at least 2-3 times of JSON size regardless of underlying language. Therefore, such HTTP call not only blocks the FE, but also introduces significant overhead on client side (WMStats UI). Therefore, it is desired to modify WMStats server to provide the following:

  • Chnage its requestcache API to return subset of data using idx/limit, a.k.a pagination, e.g. return only 10 records
  • change requestcache API to provide ndjson format to support data streaming
  • allow gzip encoding in HTTP request

Also, the WMStats UI server needs to show aggregated data such as job, event, lumi progress, failure rate, etc. These metrics are calculated via JavaScript functions residing in WMStats UI server, see WMStats/_attachments/js/Views area for all functions. Such calculations are performed all the time on WMStats UI page which leads to additional latencies (which grows with number of existing workflows in a system). It would be more appropriate to shift this functionality to either WMStats cache server or introduce appropriate views within CouchDB, and cache them in WMStats cache server. This will allow to have lightweight implementation of WMStats UI server which will only fetch and display the data.

New implementation of WMStats UI server

You may find full proposal about new implementation of WMStats UI server over here

Clone this wiki locally