-
Notifications
You must be signed in to change notification settings - Fork 107
WMStats details
This document summaries different topics about WMStats server following issue#11411. Below is a list of action items required for design discussion:
- [DONE] current data stored in the database (including required vs likely non-required data)
- [DONE] how data gets published to the wmstats database
- [DONE] functionalities provided by WMStats (e.g., ACDC creation)
- [DONE] APIs used to load data into WMStats
- [DONE] data structure of the data loaded into WMStats
- [DONE] weak and/or missing functionalities
- [DONE] proposal of new WMStats UI server implementation
The WMStats server is deployed at cmsweb cluster and provides data via the following URLs
# WMStats cache server
scurl -s https://cmsweb.cern.ch/wmstatsserver
# WMStats info API
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/info
# cache (most heavy) API which returns everything
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/requestcache
# job detail API example
https://cmsweb-testbed.cern.ch/wmstatsserver/data/jobdetail/tivanov_TC_6Tasks_Scratch_HG2301_Val_230109_081126_7143
The WMStats server APIs are defined in RestApihub.py
It fetches stores and fetches the data from underlying CouchDB. The CouchDB stores unstructured or semi-structured data in JSON data-format and does not impose any schema for stored data. Therefore, the data structure of stored data is driven by external services, e.g. the WMAgent info can be obtained via this URL call:
https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl7/_view/agentInfo
while workflow information can be obtained as following:
scurl -s -X POST -H "Content-type: application/json" -d '{"keys": ["pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816"]}' "https://cmsweb.cern.ch:8443/couchdb/reqmgr_workload_cache/_all_docs?include_docs=True" | jq
The CouchDB reqmgr_workload_cache
database stores high level description of workflow, while other parts of information come from wmstats couchdb database, mostly related to workflow evolution tasks and jobs, see
DataCacheUpdate.py.
In both cases above the structure of stored documents is defined somwwhere in WMCore code base. Said that, the structure among documents within single collection of documents remain the same, e.g. all workflows documents will have common keys/values, but not all keys are mandatory. For example, Task
sub-structure of workflow document only appear if there are tasks in given workflow, but it may be skipped if they do not exist.
The WMStats database resides in CouchDB and called reqmgr_workload_cache
. The data flows to this database comes from ReqMgr2 service which insert the initial data about request workflows, and later are updated by ReqMgr2 (during request's transitions), GlobalWorkqueue and WMAgent. The workflow JSON document comes from an initial standard spec. see StdSpecs area. Each workflow request is create via ReqMgr service which provides create/update methods, see Request.py. For more information about data flow in CouchDB please refer to this graph.
The WMStats UI server provide the following functionalities:
- list all known workflows in a system along with their aggregation information such as number of processed events, lumis, failure rate etc
- current status of all WMAgents
- aggregated information about Campaigns, Sites, CMSSW releases and WMAgents
- Infomration about failures in various requests
- information about logs of individual requests
- various search and filter capabilities about requests, campaings, sites, releases, etc.
The wmstats server code comes from DataCacheUpdate.py module which calls WMStatsReader.py
The individual data comes from different APIs, e.g. RequestDBReader.getRequestByStatus
where RequestDBReader.py
calls couchdb bystatus view.
For example (here and further we use cmsweb-test9
as an example and this URL may be directly applied to cmsweb, or cmsweb-testbed which will only provide more data):
# get list of workflows:
scurl -s https://cmsweb-test9.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus?include_docs=false
{"total_rows":47,"offset":0,"rows":[
{"id":"amaltaro_DQMHarvest_RunWhitelist_Agent214_Val_221216_110404_1633","key":"assigned","value":1671188646},
{"id":"amaltaro_ReReco_RunBlockWhite_Agent214_Val_221216_110415_6984","key":"assigned","value":1671188657},
...
To get agent info we'll call the following view
scurl -s https://cmsweb-test9.cern.ch/couchdb/wmstats/_design/WMStatsErl7/_view/agentInfo | jq | head -20
{
"total_rows": 9,
"offset": 0,
"rows": [
{
"id": "global_workqueue",
"key": "global_workqueue",
"value": {
"_id": "global_workqueue",
"_rev": "681-6a1bece20a50c1c56b6bea95fb29e961",
"agent_url": "global_workqueue",
"agent_team": "",
"agent_version": "2.1.6rc3",
"timestamp": 1671462286,
"down_components": [],
"type": "agent_info",
"down_component_detail": [],
...
The WMStatsWriter.py is responsible for writing data to CouchDB. This class is used by different WMCore components like:
In order to understand which views are used we may find them in couchdb log
grep _design /data/srv/logs/couchdb/couch.log | grep couchdb.couchdb | grep -v logdb | awk '{print $9,$10}' | head
or, we can list of all views in particular DB: https://cmsweb.cern.ch:8443/couchdb/wmstats/_design_docs
To get db view we need:
The URI to query to get a view's result is /database/_design/designdocname/_view/viewname
For example:
scurl https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl | jq
{
"_id": "_design/WMStatsErl",
"_rev": "10-01ea14e3d6042a7e0a480e0b9e485c3b",
"language": "erlang",
"views": {
"requestAgentUrl": {
"map": "fun({Doc}) ->\n DocType = couch_util:get_value(<<\"type\">>, Doc),\n case DocType of\n undefined -> ok;\n <<\"agent_request\">> ->\n AgentUrl = couch_util:get_value(<<\"agent_url\">>, Doc),\n Workflow = couch_util:get_value(<<\"workflow\">>, Doc),\n Emit([Workflow, AgentUrl], null);\n _ -> ok\n end\nend.",
"reduce": "_count"
}
},
"couchapp": {
"manifest": [
"language",
"views/",
"views/requestAgentUrl/",
"views/requestAgentUrl/map.erl",
"views/requestAgentUrl/reduce.erl"
],
"objects": {},
"signatures": {}
}
}
Here the designdocname is WMStatsErl
, the view name is requestAgentUrl
. Each view has map
and reduce
section of the document. The map
stores the map function which can be either written in JavaScript or Erlang (the language CouchDB is written), while reduce
provides reduce function used together with a given map, and it can be skipped.
To list all docs in given view we do:
https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl1/_view/byAgentURL
Or, we can pass filter to get specific keys:
scurl https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl1/_view/byAgentURL -X POST -H "Content-type: application/json" -d '{"keys": ["vocms0281.cern.ch"]}'
Or, we can view the grouping
https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl3/_view/jobsByStatusWorkflow?group=true
The actual document about given workflow which we can see from ReqMgr2 page
https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816
can be fetched directly from CouchDB as following:
scurl -s -X POST -H "Content-type: application/json" -d '{"keys": ["pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816"]}' "https://cmsweb.cern.ch:8443/couchdb/reqmgr_workload_cache/_all_docs?include_docs=True" | jq
The WMStats UI server gets its data from WMStats cache server. The latter provides full list of JSON docs as single list which leads to significant size of the payload data.
A single WMStats JSON record is quite large and its data structure can be seen over here. It consists of various attributes describing current workflow state, e.g. its name, prepID, Tasks, etc. All attributes follow CamelCase naming convention exept AgentJobInfo which mixes CamelCase with underscores naming conventions, e.g.
"AgentJobInfo": {
"cmsgwms-submit6.fnal.gov": {
"_id": "cmsgwms-submit6.fnal.gov-cmsunified_ACDC0_task_EXO-RunIISummer20UL18wmLHEGEN-00824__v1_T_220425_125613_9010",
"_rev": "274-4282735881",
"agent_team": "production",
"agent_version": "2.0.2.patch1",
"agent": "WMAgent",
"agent_url": "cmsgwms-submit6.fnal.gov",
"type": "agent_request",
"workflow": "cmsunified_ACDC0_task_EXO-RunIISummer20UL18wmLHEGEN-00824__v1_T_220425_125613_9010",
"status": {
"inQueue": 2
},
...
The production WMStats server data is quite large, e.g.
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/requestcache
takes 1.5 min to fetch and its size is around 300MB. Loading such large JSON into RAM will require at least 2-3 times of JSON size regardless of underlying language. Therefore, such HTTP call not only blocks the FE, but also introduces significant overhead on client side (WMStats UI). Therefore, it is desired to modify WMStats server to provide the following:
- Chnage its requestcache API to return subset of data using idx/limit, a.k.a pagination, e.g. return only 10 records
- change requestcache API to provide ndjson format to support data streaming
- allow gzip encoding in HTTP request
Also, the WMStats UI server needs to show aggregated data such as job, event, lumi progress, failure rate, etc. These metrics are calculated via JavaScript functions residing in WMStats UI server, see WMStats/_attachments/js/Views area for all functions. Such calculations are performed all the time on WMStats UI page which leads to additional latencies (which grows with number of existing workflows in a system). It would be more appropriate to shift this functionality to either WMStats cache server or introduce appropriate views within CouchDB, and cache them in WMStats cache server. This will allow to have lightweight implementation of WMStats UI server which will only fetch and display the data.
You may find full proposal about new implementation of WMStats UI server over here