Skip to content

WMCore Developers on shift

Alan Malta Rodrigues edited this page Aug 21, 2023 · 23 revisions

This Document is supposed to serve as a short list of responsibilities to be covered during shift weeks by the WMCore Developers.

Usually the developers in WMCore team share the load from operational responsibilities, but a portion of those are regular ones, like following meetings and providing support to other teams which use to cost a lot of time during which a parallel task requiring strong concentration is difficult to follow. The shift week is a week during which one developer is dedicated to cover most of the operational activities with a regular schedule and hence during that week his time is mostly filled with meetings and debugging. A non exhaustive list is provided bellow.

Shifter's Responsibilities:

  • Meetings - Besides our own weekly meeting, We do need to cover a set of regular meeting with other teams during which we try to provide useful technical information on the pieces of the WMCore system every team uses. For some of the meetings we do have the agreement with the people leading the meeting to have the WMCore section at the beginning, but it is useful to stay to the very end even tough not active, because many times we are asked questions which pop up on the go while discussions are ongoing. During those meeting we also tend to keep the other teams on track with our schedule of regular deployments and updates as well as with major changes or important bug fixes concerning them.
  • Producing internal reports - The WMCore developer on shift is to be serving as a contact between the outer world and the rest of the team, so upon every meeting (we tend to keep that interval short while the info is still fresh), (s)he provides a list of topics discussed during the meeting just followed, together with the replies (s)he could or could not give or eventual outcomes if a solid decision has been taken. In some of the cases these result in action items on us, so we need to be sure each of us is on track. If an GH issue needs to be created for following such an action, most of the time we request the person who brought up the topic to create the GH issue according to the templates we have provided and we follow through there.
  • Support - ideally, the person on call is expected to reply any inquiries within the same day (if during business hours).
    • During those weeks many teams have questions asked through the various channels of communication we follow, concerning internals of the system to which only we can provide information, many of them concerning not only different APIs and system behavior but also policies discussed far back in time and well forgotten.
    • Many times we need to provide support in debugging issues (especially with P&R Team) which are exceeding the level of knowledge about the system itself, not only of the people using it and asking the question, but also our won too.
  • System monitoring - We need to constantly monitor the health of the system - 24/7. We need to be sure about:
    • we do provide an uninterrupted usage for everybody who depends on WMCore system
    • we do not have components down resulting in stuck load and overfilling the system in short amount of time
    • we do provide the service bug free, and mostly taking care the way of working of the whole system to not result in data loss or corruption, e.g. because of continuous misbehavior of a component or an overlooked bugg - this is in general difficult task not only during shift weeks.
  • Debugging: The debugging we normally do is usually triggered/a followup on one the following three categories:
    • on demand:
      • Most of the time these are requests from other teams such as P&R who are looking at the general system load and are reporting misbehavior, which is noticeable in the payload - workflows' behavior.
    • on system failure
      • Those are pretty visible cases when a component breaks badly (either completel or with a cyclic pattern) and causes accumulation of big amount of backlog in some piece of the system. NOTE: It is not always mandatory the congested part of the system to be directly linked with the broken component, sometimes the backlog may be accumulated few stages after the misbehaving piece.
    • on bug discovery - not always leading to an immediate system failure NOTE: The established practice is to create a follow up GH issue right after one of the above three cases is met, and this issue to be communicated with the rest of the team. Usually the person on shift who starts the debugging takes the issue, but this is not mandatory. Many times someone else may have more knowledge about the problem at hand or an emmergency debugging may need to span beyond a single shift period and another person may need to take over. This is to be communicated internally.

Examples of typical debugging issues:

A good place to look:

Here is a wiki we started long ago for accumulating well known misbehavior cases and possible actions to mitigate the effects of them (This still needs to be updated on a regular basis though. ): https://github.com/dmwm/WMCore/wiki/trouble-shooting

Extra responsibilities

Possible responsibilities agreed upon in the past, but ones which could not fit in a fairly manner, because of the hard misaligned between the schedules of deployment cycles and shift weeks rotation:

  • Release validation - we decided to follow that in github issues and assign them on a mutual agreement
  • Monitor and support to CMSWEB Team during regular central services deployments - this more or less still holds as a pure responsibility to the person on shift, even though sometimes one of us needs to follow few consecutive cycles.
  • WMAgent deployment campaigns - currently mostly driven by Alan, because of many reasons, but we can cover him at any time if needed. The draining and monitoring is still a shared responsibility.

Developer's Responsibilities:

The more broad responsibilities of every developer in the WMCore team are listed in the following wiki: https://github.com/dmwm/WMCore/wiki/WMCore-developer-responsibilities

Channels to follow:

  • Slack channels: (DEPRECATED)

    • P&R (cms-compops-pnr.slack.com) - actively watching the #wmcore-support channel
    • WMCore (cms-dmwm.slack.com) - actively watching all channels. Special attention to:
      • #tier0-dev: to communicate with the T0 team
      • #wmcore-rucio: to communicate with a very small set of the DM experts
      • #wmagent-dev: our internal communication (it should be followed even when you are not on shift).
    • Rucio (rucio.slack.com) - passively (only when tagged) watching #cms, #cms-ops and #cms-consistency
  • Mattermost channels under the CMS O&C organization:

    • DMWM: the WM team is expected to follow this dmwm channel in a daily basis, regardless of being on shift duty or not.
    • WM Dev: the WM team is expected to follow this wm_dev channel in a daily basis to stay up-to-date with developments involving the WM system.
    • WM Ops: the WM developer on shift is expected to be the first line of contact through this wm_ops channel. It's advised to monitor this at least twice a day. Nonetheless, it is recommended to have the WM team following this as well to be on top of potentially operational issues.
    • WM Team: the wm_team channel is private and dedicated only to the core WM developers. Please also use this channel for sharing meeting summaries with the rest of the team. The WM team is expected to follow it in a daily basis as well.
    • Everything else that may concern us in the O&C group (e.g. SI..) - people are used to tag us explicitly if we are needed somewhere
  • Email groups: (in case meetings are cancelled, changed, etc and e.g.: the announcement was not sent via slack)

    • "cms-tier0-operations (CMS tier0 operations)" <cms-tier0-operations cern.ch>
    • "cms-comp-ops-workflow-team (cms-comp-ops-workflow-team)" <cms-comp-ops-workflow-team cern.ch>

Meetings to follow:

  • Monday:
    • WMCore - 16:00 CERN Time: indico
    • CompOps - 17:00 CERN Time: indico
  • Tuesday:
    • T0 - 14:00 CERN Time (first ~15min only): twiki page
  • Wednesday:
    • O&C - 15:00 CERN Time: indico
    • P&R - 16:00 CERN Time (first ~15min only): google doc
  • Friday:

Monitoring we use:

Access rights and credentials:

In order to be able to fulfill ones duties during the shift, the developer must have access to both CERN and FNAL agents. These are steps which have already been mentioned in the onboarding document here. And to elaborate a little bit on both types of agents we work with:

  • Access to FNAL agents:
    • First you need to have access to the FNAL computing resources, for which you need to send the proper request form as explained at Fermilab's site here.
    • Second you will need to contact the operators managing the FNAL schedds so that your username is given access to the proper set of machines and to be added to the proper groups and service accounts - meaning cmsdataops. The change may take effect only once FNAL regular puppet run has passed a cycle.
  • Access to CERN agents:
    • One needs his regular CERN account for that and needs to contact he VOC in order to give him the same access as for FNAL, with slight difference - the service account should be cmst1
    • For accessing the cmst1 user without the need of a password one needs to do sudo instead of su, this way the individual kerberos credentials are forwarded with the login sessions. For convenience the following alias may be set in everybody's .bashrc file:
alias cmst1='sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc'

  • The full list of machines to get access to is listed here.
  • CRIC roles one needs - ReqMgr/Data-manager

A set of usable initial commands to use once loged in to an agent:

  • Initial login to the machine:
[user@vocms0290]$ cmst1 
cmst1@vocms0290:/afs/cern.ch/user$ agentenv
  • Machine and components status management:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage status
cmst1@vocms0290:/data/srv/wmagent/current$ $manage stop-services
cmst1@vocms0290:/data/srv/wmagent/current$ $manage start-services
cmst1@vocms0290:/data/srv/wmagent/current$ $manage stop-agent
cmst1@vocms0290:/data/srv/wmagent/current$ $manage start-agent
  • Restart a subset of the agent's components:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmcoreD --restart --component JobAccountant,RucioInjector
  • Unregister an agent from WMCore central services:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-unregister-wmstats `hostname -f`
  • Check or add resources to the agent's resource control database:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_HLT -p
cmst1@vocms0290:/data/srv/wmagent/current$ $manage execute-agent wmagent-resource-control --plugin=SimpleCondorPlugin --opportunistic --pending-slots=1000 --running-slots=2000 --add-one-site T3_ES_PIC_BSC
  • Use the internal configuration and sql client to connect to the current agent's dataebase:
cmst1@vocms0290:/data/srv/wmagent/current$ $manage db-prompt wmagent

Optionally you may use the rlwrap tool, if available at the agent, in order to have a proper console output wrapper and history. e.g.:

cmst1@vocms0290:/data/srv/wmagent/current$ rlwrap -m -pgreen -H /data/tmp/.sqlplus.hist $manage db-prompt

  • Kill a workflow at the agent:
cmst1@vocms0290:/data/srv/wmagent/current $ $manage execute-agent kill-workflow-in-agent <FIXME:workflow-name> 

The WMAgent tree

  • Minimal depth of the WMAgent tree, starting from the currentdeployment
cmst1@vocms0290:/data/srv/wmagent/current $ tree -lL 3
.
├── apps -> apps.sw
│   ├── wmagent -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3
│   │   ├── bin
│   │   ├── data
│   │   ├── doc
│   │   ├── etc
│   │   ├── lib
│   │   ├── xbin
│   │   ├── xdata
│   │   ├── xdoc
│   │   └── xlib
│   └── wmagentpy3 -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3  [recursive, not followed]
├── apps.sw
│   ├── wmagent -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3  [recursive, not followed]
│   └── wmagentpy3 -> ../sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.1.pre3  [recursive, not followed]
├── auth
├── bin
├── config
│   ├── couchdb
│   │   └── local.ini
│   ├── mysql
│   │   └── my.cnf
│   ├── rucio
│   │   └── etc
│   ├── wmagent -> ../config/wmagentpy3
│   │   ├── config.py
│   │   ├── config.py~
│   │   ├── config-template.py
│   │   ├── deploy
│   │   ├── local.ini
│   │   ├── manage
│   │   ├── my.cnf
│   │   ├── __pycache__
│   │   └── rucio.cfg
│   └── wmagentpy3
│       ├── config.py
│       ├── config.py~
│       ├── config-template.py
│       ├── deploy
│       ├── local.ini
│       ├── manage
│       ├── my.cnf
│       ├── __pycache__
│       └── rucio.cfg
├── install
│   ├── couchdb
│   │   ├── certs
│   │   ├── database
│   │   └── logs
│   ├── mysql
│   │   ├── database
│   │   └── logs
│   └── wmagentpy3
│       ├── AgentStatusWatcher
│       ├── AnalyticsDataCollector
│       ├── ArchiveDataReporter
│       ├── DBS3Upload
│       ├── ErrorHandler
│       ├── JobAccountant
│       ├── JobArchiver
│       ├── JobCreator
│       ├── JobStatusLite
│       ├── JobSubmitter
│       ├── JobTracker
│       ├── JobUpdater
│       ├── RetryManager
│       ├── RucioInjector
│       ├── TaskArchiver
│       └── WorkQueueManager
└── sw
    ├── bin
    │   ├── cmsarch -> ../common/cmsarch
    │   ├── cmsos -> ../common/cmsarch
    │   └── scramv1 -> ../common/scramv1
    ├── bootstrap.sh
    ├── bootstrap-slc7_amd64_gcc630.log
    ├── bootstraptmp
    ├── cmsset_default.csh
    ├── cmsset_default.sh
    ├── common
    │   ├── cmsarch
    │   ├── cmsos
    │   ├── cmspkg
    │   ├── migrate-cvsroot
    │   ├── scram
    │   ├── scramv0 -> scram
    │   └── scramv1 -> scram
    ├── data -> /data
    │   ├── admin
    │   ├── certs
    │   ├── khurtado
    │   ├── lost+found
    │   ├── srv
    │   └── tmp
    ├── etc
    │   └── cms-common
    ├── share
    │   └── cms
    └── slc7_amd64_gcc630
        ├── cms
        ├── etc
        ├── external
        ├── tmp
        └── var

  • All component logs can be found here:
cmst1@vocms0290:/data/srv/wmagent/current $ ls -ls /data/srv/wmagent/current/install/wmagentpy3/*/ComponentLog
827896 -rw-r--r--. 1 cmst1 zh 847759271 Aug 24 19:54 /data/srv/wmagent/current/install/wmagentpy3/AgentStatusWatcher/ComponentLog
 13484 -rw-r--r--. 1 cmst1 zh  13799746 Oct 19 08:38 /data/srv/wmagent/current/install/wmagentpy3/AnalyticsDataCollector/ComponentLog
  4244 -rw-r--r--. 1 cmst1 zh   4337901 Oct 19 08:40 /data/srv/wmagent/current/install/wmagentpy3/ArchiveDataReporter/ComponentLog
  4092 -rw-r--r--. 1 cmst1 zh   4182158 Sep  1 16:23 /data/srv/wmagent/current/install/wmagentpy3/DBS3Upload/ComponentLog
 11412 -rw-r--r--. 1 cmst1 zh  11680500 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/ErrorHandler/ComponentLog
  3560 -rw-r--r--. 1 cmst1 zh   3640859 Oct 19 08:42 /data/srv/wmagent/current/install/wmagentpy3/JobAccountant/ComponentLog
 17716 -rw-r--r--. 1 cmst1 zh  18136882 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/ComponentLog
 11240 -rw-r--r--. 1 cmst1 zh  11504668 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobCreator/ComponentLog
 21708 -rw-r--r--. 1 cmst1 zh  22220852 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobStatusLite/ComponentLog
 49336 -rw-r--r--. 1 cmst1 zh  50512403 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobSubmitter/ComponentLog
 26964 -rw-r--r--. 1 cmst1 zh  27606966 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/JobTracker/ComponentLog
 16576 -rw-r--r--. 1 cmst1 zh  16966263 Oct 19 08:43 /data/srv/wmagent/current/install/wmagentpy3/JobUpdater/ComponentLog
 14368 -rw-r--r--. 1 cmst1 zh  14707697 Oct 19 08:45 /data/srv/wmagent/current/install/wmagentpy3/RetryManager/ComponentLog
 55756 -rw-r--r--. 1 cmst1 zh  57089235 Oct 19 08:41 /data/srv/wmagent/current/install/wmagentpy3/RucioInjector/ComponentLog
 22684 -rw-r--r--. 1 cmst1 zh  23221159 Oct 19 08:42 /data/srv/wmagent/current/install/wmagentpy3/TaskArchiver/ComponentLog
600168 -rw-r--r--. 1 cmst1 zh 614565975 Oct 19 08:44 /data/srv/wmagent/current/install/wmagentpy3/WorkQueueManager/ComponentLog
Clone this wiki locally