-
Notifications
You must be signed in to change notification settings - Fork 107
WMCore Developers on shift
This Document is supposed to serve as a short list of responsibilities to be covered during shift weeks by the WMCore Developers.
Usually the developers in WMCore team share the load from operational responsibilities, but a portion of those are regular ones, like following meetings and providing support to other teams which use to cost a lot of time during which a parallel task requiring strong concentration is difficult to follow. The shift week is a week during which one developer is dedicated to cover most of the operational activities with a regular schedule and hence during that week his time is mostly filled with meetings and debugging. A non exhaustive list is provided bellow.
- Meetings - Besides our own weekly meeting, We do need to cover a set of regular meeting with other teams during which we try to provide useful technical information on the pieces of the WMCore system every team uses. For some of the meetings we do have the agreement with the people leading the meeting to have the WMCore section at the beginning, but it is useful to stay to the very end even tough not active, because many times we are asked questions which pop up on the go while discussions are ongoing. During those meeting we also tend to keep the other teams on track with our schedule of regular deployments and updates as well as with major changes or important bug fixes concerning them.
- Producing internal reports - The WMCore developer on shift is to be serving as a contact between the outer world and the rest of the team, so upon every meeting (we tend to keep that interval short while the info is still fresh), (s)he provides a list of topics discussed during the meeting just followed, together with the replies (s)he could or could not give or eventual outcomes if a solid decision has been taken. In some of the cases these result in action items on us, so we need to be sure each of us is on track. If an GH issue needs to be created for following such an action, most of the time we request the person who brought up the topic to create the GH issue according to the templates we have provided and we follow through there.
-
Support - if possible in some feasible response time.
- During those weeks many teams have questions asked through the various channels of communication we follow, concerning internals of the system to which only we can provide information, many of them concerning not only different APIs and system behavior but also policies discussed far back in time and well forgotten.
- Many times we need to provide support in debugging issues (especially with P&R Team) which are exceeding the level of knowledge about the system itself, not only of the people using it and asking the question, but also our won too.
-
System monitoring - We need to constantly monitor the health of the system - 24/7. We need to be sure about:
- we do provide an uninterrupted usage for everybody who depends on WMCore system
- we do not have components down resulting in stuck load and overfilling the system in short amount of time
- we do provide the service bug free, and mostly taking care the way of working of the whole system to not result in data loss or corruption, e.g. because of continuous misbehavior of a component or an overlooked bugg - this is in general difficult task not only during shift weeks.
-
Debugging:
The debugging we normally do is usually triggered/a followup on one the following three categories:
- on demand:
- Most of the time these are requests from other teams such as P&R who are looking at the general system load and are reporting misbehavior, which is noticeable in the payload - workflows' behavior.
- on system failure
- Those are pretty visible cases when a component breaks badly (either completel or with a cyclic pattern) and causes accumulation of big amount of backlog in some piece of the system. NOTE: It is not always mandatory the congested part of the system to be directly linked with the broken component, sometimes the backlog may be accumulated few stages after the misbehaving piece.
- on bug discovery - not always leading to an immediate system failure NOTE: The established practice is to create a follow up GH issue right after one of the above three cases is met, and this issue to be communicated with the rest of the team. Usually the person on shift who starts the debugging takes the issue, but this is not mandatory. Many times someone else may have more knowledge about the problem at hand or an emmergency debugging may need to span beyond a single shift period and another person may need to take over. This is to be communicated internally.
- on demand:
Examples of typical debugging issues:
- https://github.com/dmwm/WMCore/issues/11187
- https://github.com/dmwm/WMCore/issues/11186
- https://github.com/dmwm/WMCore/issues/11168
- https://github.com/dmwm/WMCore/issues/10026
A good place to look:
Here is a wiki we started long ago for accumulating well known misbehavior cases and possible actions to mitigate the effects of them (This still needs to be updated on a regular basis though. ): https://github.com/dmwm/WMCore/wiki/trouble-shooting
Possible responsibilities agreed upon in the past, but ones which could not fit in a fairly manner, because of the hard misaligned between the schedules of deployment cycles and shift weeks rotation:
- Release validation - we decided to follow that in github issues and assign them on a mutual agreement
- Monitor and support to CMSWEB Team during regular central services deployments - this more or less still holds as a pure responsibility to the person on shift, even though sometimes one of us needs to follow few consecutive cycles.
- WMAgent deployment campaigns - currently mostly driven by Alan, because of many reasons, but we can cover him at any time if needed. The draining and monitoring is still a shared responsibility.
The more broad responsibilities of every developer in the WMCore team are listed in the following wiki: https://github.com/dmwm/WMCore/wiki/WMCore-developer-responsibilities
-
Slack channels:
- P&R (cms-compops-pnr.slack.com) - actively watching the
#wmcore-support
channel - WMCore (cms-dmwm.slack.com) - actively watching all channels. Special attention to:
-
#tier0-dev
: to communicate with the T0 team -
#wmcore-rucio
: to communicate with a very small set of the DM experts -
#wmagent-dev
: our internal communication (it should be followed even when you are not on shift).
-
- Rucio (rucio.slack.com) - passively (only when tagged) watching
#cms
,#cms-ops
and#cms-consistency
- P&R (cms-compops-pnr.slack.com) - actively watching the
-
Mattermost channels:
- All that may concern us in the O&C group (e.g. SI..) - people use to tag us explicitly if we are needed somewhere
- DMWM is a must - https://mattermost.web.cern.ch/cms-o-and-c/channels/dmwm
-
** Email groups: ** (in case meetings are cancelled, changed, etc and e.g.: the announcement was not sent via slack)
- "cms-tier0-operations (CMS tier0 operations)" <cms-tier0-operations cern.ch>
- "cms-comp-ops-workflow-team (cms-comp-ops-workflow-team)" <cms-comp-ops-workflow-team cern.ch>
- Monday:
-
Tuesday:
- T0 - 14:00 CERN Time (first ~15min only): twiki page
-
Wednesday:
- O&C - 15:00 CERN Time: indico
- P&R - 16:00 CERN Time (first ~15min only): google doc
-
Friday:
- P&R development - 16:00 CERN Time: zoom
- WMAgent dashboard: https://monit-grafana.cern.ch/d/lhVKAhNik/cms-wmagent-monitoring?orgId=11
- Jobs dashbords:
- CMS Job monitoring 12 min: https://monit-grafana.cern.ch/d/o3dI49GMz/cms-job-monitoring-12m?orgId=11
- CMS Job monitoring 12min bars: https://monit-grafana.cern.ch/d/chVH8ZoGk/cms-job-monitoring-12m-bars?orgId=11
- CMS Job Monitoring ES agg data: https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring-es-agg-data-official?orgId=11&refresh=15m
- CMS Job Monitoring ES agg data: https://monit-grafana.cern.ch/d/Y08Xu0oGz/cms-job-monitoring-es-agg-data-official-bars?orgId=11&refresh=15m
- WMStats: https://cmsweb.cern.ch/wmstats/index.html
- The place to check/maintain the list of all currently active agents, is the following GH project board: https://github.com/dmwm/WMCore/projects/5
- The place where to check the latest WMAgent versions/releases is the following project board: https://github.com/dmwm/WMCore/projects/29
In order to be able to fulfill ones duties during the shift, the developer must have access to both CERN and FNAL agents. These are steps which have already been mentioned in the onboarding document here. And to elaborate a little bit on both types of agents we work with:
- Access to FNAL agents:
- First you need to have access to the FNAL computing resources, for which you need to send the proper request form as explained at Fermilab's site here.
- Second you will need to contact the operators managing the FNAL schedds so that your username is given access to the proper set of machines and to be added to the proper groups and service accounts - meaning
cmsdataops
. The change may take effect only once FNAL regular puppet run has passed a cycle.
- Access to CERN agents:
- One needs his regular CERN account for that and needs to contact he VOC in order to give him the same access as for FNAL, with slight difference - the service account should be
cmst1
- For accessing the
cmst1
user without the need of a password one needs to dosudo
instead ofsu
, this way the individual kerberos credentials are forwarded with the login sessions. For convenience the following alias may be set in everybody's.bashrc
file:
- One needs his regular CERN account for that and needs to contact he VOC in order to give him the same access as for FNAL, with slight difference - the service account should be
alias cmst1='sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc'
- The full list of machines to get access to is listed here.
- CRIC roles one needs - ReqMgr/Data-manager
- Initial login to the machine:
[user@vocms0290]$ cmst1
cmst1@vocms0290:/afs/cern.ch/user$ agentenv
- Machine and components status management:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage status
cmst1@vocms0290:/data/srv/wmagent/current$ $manage stop-services
cmst1@vocms0290:/data/srv/wmagent/current$ $manage start-services
cmst1@vocms0290:/data/srv/wmagent/current$ $manage stop-agent
cmst1@vocms0290:/data/srv/wmagent/current$ $manage start-agent
- Restart a subset of the agent's components:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmcoreD --restart --component JobAccountant,RucioInjector
- Unregister an agent from WMCore central services:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-unregister-wmstats `hostname -f`
- Check or add resources to the agent's resource control database:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_HLT -p
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-resource-control --plugin=SimpleCondorPlugin --opportunistic --pending-slots=1000 --running-slots=2000 --add-one-site T3_ES_PIC_BSC
- Use the internal configuration and sql client to connect to the current agent's dataebase:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage db-prompt wmagent
Optionally you may use the rlwrap
tool, if available at the agent, in order to have a proper console output wrapper and history. e.g.:
cmst1@vocms0290:/data/srv/wmagent/current$ rlwrap -m -pgreen -H /data/tmp/.sqlplus.hist $manage db-prompt
- Kill a workflow at the agent:
cmst1@vocms0290:/data/srv/wmagent/current $ $manage execute-agent kill-workflow-in-agent <FIXME:workflow-name>
- Minimal depth of the WMAgent tree, starting from the
current
deployment
cmst1@vocms0290:/data/srv/wmagent/current $ tree -lL 3
.
├── apps -> apps.sw
│ ├── wmagent -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3
│ │ ├── bin
│ │ ├── data
│ │ ├── doc
│ │ ├── etc
│ │ ├── lib
│ │ ├── xbin
│ │ ├── xdata
│ │ ├── xdoc
│ │ └── xlib
│ └── wmagentpy3 -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3 [recursive, not followed]
├── apps.sw
│ ├── wmagent -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3 [recursive, not followed]
│ └── wmagentpy3 -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3 [recursive, not followed]
├── auth
├── bin
├── config
│ ├── couchdb
│ │ └── local.ini
│ ├── mysql
│ │ └── my.cnf
│ ├── rucio
│ │ └── etc
│ ├── wmagent -> ../config/wmagentpy3
│ │ ├── config.py
│ │ ├── config.py~
│ │ ├── config-template.py
│ │ ├── deploy
│ │ ├── local.ini
│ │ ├── manage
│ │ ├── my.cnf
│ │ ├── __pycache__
│ │ └── rucio.cfg
│ └── wmagentpy3
│ ├── config.py
│ ├── config.py~
│ ├── config-template.py
│ ├── deploy
│ ├── local.ini
│ ├── manage
│ ├── my.cnf
│ ├── __pycache__
│ └── rucio.cfg
├── install
│ ├── couchdb
│ │ ├── certs
│ │ ├── database
│ │ └── logs
│ ├── mysql
│ │ ├── database
│ │ └── logs
│ └── wmagentpy3
│ ├── AgentStatusWatcher
│ ├── AnalyticsDataCollector
│ ├── ArchiveDataReporter
│ ├── DBS3Upload
│ ├── ErrorHandler
│ ├── JobAccountant
│ ├── JobArchiver
│ ├── JobCreator
│ ├── JobStatusLite
│ ├── JobSubmitter
│ ├── JobTracker
│ ├── JobUpdater
│ ├── RetryManager
│ ├── RucioInjector
│ ├── TaskArchiver
│ └── WorkQueueManager
└── sw
├── bin
│ ├── cmsarch -> ../common/cmsarch
│ ├── cmsos -> ../common/cmsarch
│ └── scramv1 -> ../common/scramv1
├── bootstrap.sh
├── bootstrap-slc7_amd64_gcc630.log
├── bootstraptmp
├── cmsset_default.csh
├── cmsset_default.sh
├── common
│ ├── cmsarch
│ ├── cmsos
│ ├── cmspkg
│ ├── migrate-cvsroot
│ ├── scram
│ ├── scramv0 -> scram
│ └── scramv1 -> scram
├── data -> /data
│ ├── admin
│ ├── certs
│ ├── khurtado
│ ├── lost+found
│ ├── srv
│ └── tmp
├── etc
│ └── cms-common
├── share
│ └── cms
└── slc7_amd64_gcc630
├── cms
├── etc
├── external
├── tmp
└── var
- All component logs can be found here:
cmst1@vocms0290:/data/srv/wmagent/current $ ls -ls /data/srv/wmagent/current/install/wmagentpy3/*/ComponentLog
827896 -rw-r--r--. 1 cmst1 zh 847759271 Aug 24 19:54 /data/srv/wmagent/current/install/wmagentpy3/AgentStatusWatcher/ComponentLog
13484 -rw-r--r--. 1 cmst1 zh 13799746 Oct 19 08:38 /data/srv/wmagent/current/install/wmagentpy3/AnalyticsDataCollector/ComponentLog
4244 -rw-r--r--. 1 cmst1 zh 4337901 Oct 19 08:40 /data/srv/wmagent/current/install/wmagentpy3/ArchiveDataReporter/ComponentLog
4092 -rw-r--r--. 1 cmst1 zh 4182158 Sep 1 16:23 /data/srv/wmagent/current/install/wmagentpy3/DBS3Upload/ComponentLog
11412 -rw-r--r--. 1 cmst1 zh 11680500 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/ErrorHandler/ComponentLog
3560 -rw-r--r--. 1 cmst1 zh 3640859 Oct 19 08:42 /data/srv/wmagent/current/install/wmagentpy3/JobAccountant/ComponentLog
17716 -rw-r--r--. 1 cmst1 zh 18136882 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/ComponentLog
11240 -rw-r--r--. 1 cmst1 zh 11504668 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobCreator/ComponentLog
21708 -rw-r--r--. 1 cmst1 zh 22220852 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobStatusLite/ComponentLog
49336 -rw-r--r--. 1 cmst1 zh 50512403 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobSubmitter/ComponentLog
26964 -rw-r--r--. 1 cmst1 zh 27606966 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobTracker/ComponentLog
16576 -rw-r--r--. 1 cmst1 zh 16966263 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobUpdater/ComponentLog
14368 -rw-r--r--. 1 cmst1 zh 14707697 Oct 19 08:45 /data/srv/wmagent/current/install/wmagentpy3/RetryManager/ComponentLog
55756 -rw-r--r--. 1 cmst1 zh 57089235 Oct 19 08:41 /data/srv/wmagent/current/install/wmagentpy3/RucioInjector/ComponentLog
22684 -rw-r--r--. 1 cmst1 zh 23221159 Oct 19 08:42 /data/srv/wmagent/current/install/wmagentpy3/TaskArchiver/ComponentLog
600168 -rw-r--r--. 1 cmst1 zh 614565975 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/WorkQueueManager/ComponentLog