-
Notifications
You must be signed in to change notification settings - Fork 19
Setup and Instructions
(Requires pip
and virtualenv
)
git clone [email protected]:dmwm/cms-htcondor-es.git -b master
cd cms-htcondor-es
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
At this point a dry run should run without errors. (This only queries the collector for a list of schedds, but doesn't actually query them or upload anything.)
python spider_cms.py --process_queue --dry_run
CERN MONIT:
Create username
and password
files in the cms-htcondor-es
directory containing the correct authentication credentials obtained from CERN MONIT.
es-cms:
Create a file called es.conf
with the following format (and the corresponding credentials):
User: <username>
Pass: <password>
Running the full queries and uploading machinery with the current options. (Add --read_only
to only query, without uploading.)
python spider_cms.py --feed_amq --process_queue --query_pool_size 16 --upload_pool_size 8
The script first queries three condor collectors for a list of schedds (removing duplicates). Currently there are 63 schedds being processed. First the histories of each schedd are queried for completed jobs and the documents uploaded. When that is finished, the condor queues of each schedd are being queried for running and pending jobs, and those documents are being uploaded. See below for a description of the parallelization of these tasks.
Easiest is to create a spider_cms.sh
script with the corresponding setup and options, e.g. like this:
#!/bin/bash
cd /path/to/cms-htcondor-es/
source venv/bin/activate
python spider_cms.py --feed_amq --process_queue --query_pool_size 16 --upload_pool_size 8
Edit your cron tab with crontab -e
and add the following line:
*/12 * * * * /path/to/cms-htcondor-es/spider_cms.sh
Finally, it can be useful to have email alerts for failing queries and timeouts. They are set up with the --email_alerts [email protected]
option. (Currently only a single recipient is possible.)
There are several useful options for debugging and testing:
-
--dry_run
just query the collector for a list of schedds and skip the queries and the uploading -
--read_only
do the queries but skip the uploading -
--schedd_filter
process only a (comma-separated) list of schedds -
--skip_history
do only the queue data -
--log_level
setting toINFO
orDEBUG
gives additional output about internal queueing, bunching, and uploading.
For tuning the performance, several options are available:
- Pool sizes:
--upload_pool_size
and--query_pool_size
define the number of concurrent processes for uploading and querying, respectively. The default is 8 for each, but we are currently running smoothly with 16 query pools and 8 upload pools on a machine with 16 cores. Note: the query pool is also used for processing the condor history, i.e. that option also determines the number of parallel processes querying and uploading documents for completed jobs. - Upload bunching:
--amq_bunch_size
and--es_bunch_size
define the size of the bunches that are sent to AMQ and Elasticsearch, respectively. The current default is 5000 documents for AMQ and 250 for ES, which takes about 1 second from the CERN-based VM, and took about 15 seconds from the UNL-based VM. - Internal bunching:
--query_queue_batch_size
defines the size of batches of documents sent from the query processes to the internal process assembling bunches for uploading. Current default is 50.
I.e. the current setup for processing the queues is as follows: 16 parallel processes each query the queue of a single schedd for running and pending jobs. After having obtained 50 documents, they are sent to an input queue and the query continues until all jobs have been processed. An internal process gets these batches of job documents on the input queue and assembles them into bunches of 5000 documents, which are sent to one of 8 parallel upload processes. The script shuts down when either all jobs have been processed and uploaded, or after 11 minutes of running time.
The histories are processed more simply. Each one of 8 parallel processes queries a schedd for documents. After reaching 5000 documents (configurable with --bunching
), they are uploaded. When the upload finishes, the query continues until all jobs and all schedds have been processed.