This guide provides an overview of common steps to narrow down failures seen during clusterm workflows.
This section details what goes on behind the scenes when a cluster manager workflow is triggered. The section is divided into following sub-sections:
- Discovery: lists details of discovery process
- Commission: lists details of the node(s) commission process
- Decommission: lists details of the node(s) decommission process
- Update: lists details of the node(s) update process
Clusterm uses ansible playbooks for handling different workflows. So being able to find the ansible task that failed is crucial to troubleshooting configuration failure. Parsing Ansible Logs section walks through a few samples of ansible logs and patterns to look for failures.
For interested reader, the ansible playbooks used by cluster manager are located here. The understanding of ansible playbooks is not necessary for troubleshooting configuration failures.
The Discovery workflow runs site.yml playbook with the nodes that are being discovered added to the cluster-node
host-group.
This playbook provisions serf on the nodes.
Once the configuration job is done, i.e. when clusterctl job get active
command returns info for "active" job doesn't exist
error, the node shall become visible in the output of clusterctl nodes get
command.
Once the configuration job is done, i.e. when clusterctl job get active
returns info for "active" job doesn't exist
error, the node will still not be visible in the output of clusterctl nodes get
command.
Refer Useful Commands section to see commands to run and making sense of the logs.
The Commission workflow runs site.yml playbook with the nodes, that are being commissioned, added to either service-master
or service-worker
host-group as specified in the commission command.
While commission is in progress the node's status is changed to Provisioning
.
Once the configuration job is done, i.e. when clusterctl job get active
returns info for "active" job doesn't exist
error, the node status is be updated to Allocated
in the clusterctl nodes get
command.
Once the configuration job is done, i.e. when clusterctl job get active
returns info for "active" job doesn't exist
error, the node status is be updated to Unallocated
in the clusterctl nodes get
command. The cleanup.yml playbook is run on the node to remove any partial configuration remanants.
Refer Useful Commands section to see commands to run and making sense of the logs.
The Decommission workflow runs cleanup.yml playbook on the nodes that are being decommissioned. This shall stop the infra services and cleanup any service state from the node.
While decommission is in progress the node's status is changed to Cancelled
.
Once the configuration job is done, i.e. clusterctl job get active
returns info for "active" job doesn't exist
error, the node status is be updated to Decommissioned
in the clusterctl nodes get
command.
The decommission playbook runs in it's entirety to do the best effort cleanup and is not expected to return failure.
The Update workflow runs cleanup.yml playbook followed by the site.yml playbook with the nodes, that are being updated, added to previously-set host-group as specified in the commission command. If --host-group
flag is provides then all nodes are provisioned as part of that host-group.
While update is in progress the node's status is changed to Maintenance
.
Once the configuration job is done, i.e. when clusterctl job get active
returns info for "active" job doesn't exist
error, the node status is be updated to Allocated
in the clusterctl nodes get
command.
Once the configuration job is done, i.e. when clusterctl job get active
returns info for "active" job doesn't exist
error, the node status is be updated to Unallocated
in the clusterctl nodes get
command. The cleanup.yml playbook is run on the node to remove any partial configuration remanants.
Refer Useful Commands section to see commands to run and making sense of the logs.
This section lists the a few useful commands, with sample outputs and common examples of parsing the logs.
This command dumps the info for all the nodes in the inventory. The current inventory status of node is shown after status
field under Inventory State
. Following is a sample output for a 3 node cluster:
[vagrant@cluster-node1 ~]$ clusterctl nodes get
cluster-node1-0: Inventory State
cluster-node1-0: name: cluster-node1-0
cluster-node1-0: prev_state: Discovered
cluster-node1-0: prev_status: Provisioning
cluster-node1-0: state: Discovered
cluster-node1-0: status: Allocated
cluster-node1-0: Monitoring State
cluster-node1-0: label: cluster-node1
cluster-node1-0: management_address: 192.168.2.10
cluster-node1-0: serial_number: 0
cluster-node1-0: Configuration State
cluster-node1-0: host_group: service-master
cluster-node1-0: inventory_name: cluster-node1-0
cluster-node1-0: inventory_vars:
cluster-node1-0: etcd_master_addr:
cluster-node1-0: etcd_master_name:
cluster-node1-0: node_addr: 192.168.2.10
cluster-node1-0: node_name: cluster-node1-0
cluster-node1-0: ssh_address: 192.168.2.10
cluster-node2-0: Inventory State
cluster-node2-0: name: cluster-node2-0
cluster-node2-0: prev_state: Discovered
cluster-node2-0: prev_status: Cancelled
cluster-node2-0: state: Discovered
cluster-node2-0: status: Decommissioned
cluster-node2-0: Monitoring State
cluster-node2-0: label: cluster-node2
cluster-node2-0: management_address: 192.168.2.11
cluster-node2-0: serial_number: 0
cluster-node2-0: Configuration State
cluster-node2-0: host_group: service-master
cluster-node2-0: inventory_name: cluster-node2-0
cluster-node2-0: inventory_vars:
cluster-node2-0: etcd_master_addr:
cluster-node2-0: etcd_master_name:
cluster-node2-0: node_addr: 192.168.2.11
cluster-node2-0: node_name: cluster-node2-0
cluster-node2-0: ssh_address: 192.168.2.11
cluster-node3-0: Inventory State
cluster-node3-0: name: cluster-node3-0
cluster-node3-0: prev_state: Unknown
cluster-node3-0: prev_status: Incomplete
cluster-node3-0: state: Discovered
cluster-node3-0: status: Unallocated
cluster-node3-0: Monitoring State
cluster-node3-0: label: cluster-node3
cluster-node3-0: management_address: 192.168.2.12
cluster-node3-0: serial_number: 0
cluster-node3-0: Configuration State
cluster-node3-0: host_group: service-master
cluster-node3-0: inventory_name: cluster-node3-0
cluster-node3-0: inventory_vars:
cluster-node3-0: node_addr: 192.168.2.12
cluster-node3-0: node_name: cluster-node3-0
cluster-node3-0: ssh_address: 192.168.2.12
In output above the node cluster-node1-0
is already commissioned and hence it's status is set to Allocated
. The node cluster-node2-0
has been decommissioned and hence it's status is set to Decommissioned
. The node cluster-node3-0
has not been commissioned yet, and hence it's status is set to Unallocated
.
This command dumps the ansible logs for the last configuration job that was run. It is useful to collect the output of this command in case of configuration failures.
The Parsing Ansible Logs section walks through a few samples of ansible logs and patterns to look for failures.
[vagrant@cluster-node1 ~]$ clusterctl job get last
Description: commissionEvent: nodes:[cluster-node2-0] extra-vars:{"docker_device":"/tmp/docker"} host-group:service-worker
Status: Errored
Error: exit status 2
Logs:
[DEPRECATION WARNING]: Instead of sudo/sudo_user, use become/become_user and
make sure become_method is 'sudo' (default).
This feature will be removed in a
future release. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
PLAY [devtest] *****************************************************************
skipping: no hosts matched
PLAY [volplugin-test] **********************************************************
skipping: no hosts matched
PLAY [cluster-node] ************************************************************
skipping: no hosts matched
PLAY [cluster-control] *********************************************************
skipping: no hosts matched
PLAY [service-master] **********************************************************
skipping: no hosts matched
PLAY [service-worker] **********************************************************
TASK [setup] *******************************************************************
ok: [cluster-node2-0]
TASK [base : include] **********************************************************
skipping: [cluster-node2-0]
TASK [base : include] **********************************************************
included: /vagrant/vendor/ansible/roles/base/tasks/redhat_tasks.yml for cluster-node2-0
TASK [base : install epel release package (redhat)] ****************************
ok: [cluster-node2-0]
TASK [base : install/upgrade base packages (redhat)] ***************************
skipping: [cluster-node2-0] => (item=[u'yum-utils', u'ntp', u'unzip', u'bzip2', u'curl', u'python-requests', u'bash-completion', u'kernel', u'libselinux-python'])
...
...
This command dumps the ansible logs for the active configuration job that is in progress. It can be useful to check the status of a running job.
[vagrant@cluster-node1 ~]$ clusterctl job get active
Description: decommissionEvent: nodes:[cluster-node2-0] extra-vars: {}
Status: Running
Error:
Logs:
[DEPRECATION WARNING]: Instead of sudo/sudo_user, use become/become_user and
make sure become_method is 'sudo' (default).
This feature will be removed in a
future release. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
PLAY [all] *********************************************************************
TASK [setup] *******************************************************************
ok: [cluster-node2-0]
TASK [include_vars] ************************************************************
ok: [cluster-node2-0] => (item=contiv_network)
ok: [cluster-node2-0] => (item=contiv_storage)
ok: [cluster-node2-0] => (item=swarm)
ok: [cluster-node2-0] => (item=ucp)
ok: [cluster-node2-0] => (item=docker)
ok: [cluster-node2-0] => (item=etcd)
TASK [include] *****************************************************************
included: /vagrant/vendor/ansible/roles/contiv_network/tasks/cleanup.yml for cluster-node2-0
included: /vagrant/vendor/ansible/roles/contiv_storage/tasks/cleanup.yml for cluster-node2-0
included: /vagrant/vendor/ansible/roles/swarm/tasks/cleanup.yml for cluster-node2-0
included: /vagrant/vendor/ansible/roles/ucp/tasks/cleanup.yml for cluster-node2-0
included: /vagrant/vendor/ansible/roles/docker/tasks/cleanup.yml for cluster-node2-0
included: /vagrant/vendor/ansible/roles/etcd/tasks/cleanup.yml for cluster-node2-0
included: /vagrant/vendor/ansible/roles/ucarp/tasks/cleanup.yml for cluster-node2-0
TASK [stop netmaster] **********************************************************
changed: [cluster-node2-0]
TASK [stop aci-gw container] ***************************************************
fatal: [cluster-node2-0]: FAILED! => {"changed": false, "failed": true, "msg": "systemd could not find the requested service \"'aci-gw'\": "}
...ignoring
TASK [stop netplugin] **********************************************************
...
...
This section lists a few common patterns to look for while parsing the ansible logs gathered from the clusterctl job get <active|last>
commands above. A common troubleshooting step is to identify the first failing task while parsing the logs from top to down.
The following sample logs indicates a successful execution of a task, note the status is either changed
or ok
.
TASK [contiv_cluster : copy conf files for clusterm] ***************************
changed: [cluster-node1] => (item=clusterm.args)
changed: [cluster-node1] => (item=clusterm.conf)
TASK [base : install and start ntp] ********************************************
ok: [cluster-node2-0]
The following sample logs indicate the tasks have errored, note the status is fatal
. Usually the output will also indicate the cause for failure.
TASK [ucp : download and install ucp images] ***********************************
fatal: [cluster-node2-0]: FAILED! => {"changed": true, "cmd": "docker run --rm -t --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp images --image-version=1.1.0", "delt
a": "0:00:10.694920", "end": "2016-05-25 21:32:54.185655", "failed": true, "rc": 1, "start": "2016-05-25 21:32:43.490735", "stderr": "Unable to find image 'docker/ucp:latest' locally\nlates
t: Pulling from docker/ucp\ne3f56cd84e02: Pulling fs layer\n8bfd2fd2949a: Pulling fs layer\n6dc0f52613dd: Pulling fs layer\n9775bf111871: Pulling fs layer\n9775bf111871: Verifying Checksum\
n9775bf111871: Download complete\ne3f56cd84e02: Verifying Checksum\ne3f56cd84e02: Download complete\n8bfd2fd2949a: Verifying Checksum\n8bfd2fd2949a: Download complete\n6dc0f52613dd: Verifyi
ng Checksum\n6dc0f52613dd: Download complete\ne3f56cd84e02: Pull complete\n8bfd2fd2949a: Pull complete\n6dc0f52613dd: Pull complete\n9775bf111871: Pull complete\nDigest: sha256:8e28e9024127
ac0f100adbfae0f74311201dfbc8d45968e761cc5af962abf581\nStatus: Downloaded newer image for docker/ucp:latest", "stdout": "\u001b[31mFATA\u001b[0m[0000] Your engine version 1.9.1 is too old.
UCP requires at least version 1.10.0. ", "stdout_lines": ["\u001b[31mFATA\u001b[0m[0000] Your engine version 1.9.1 is too old. UCP requires at least version 1.10.0. "], "warnings": []}
In output above logs the error is evident in the logs Your engine version 1.9.1 is too old. UCP requires at least version 1.10.0.
TASK [docker : ensure docker is started] *********************************,
fatal: [Docker-1-FLM19379EUC]: FAILED! => {changed: false, failed: true, msg: Job for docker.service failed because the control process exited with error code. See systemctl status docker.service and journalctl -xe for details.},
In output above logs the failure reason is not evident in the logs but there is a hint to run subsequent commands to get the service logs using systemd commands. Also see Systemd Service Logs section for more information on these commands.
The following log indicates the task errored but the error is ignored, note the status is fatal
but at the end ...ignoring
indicates that error was ignored.
TASK [docker : restart docker (first time)] ************************************
fatal: [cluster-node2-0]: FAILED! => {"failed": true, "msg": "The conditional check 'thin_provisioned|changed' failed. The error was: |changed expects a dictionary\n\nThe error appears
to have been in '/vagrant/vendor/ansible/roles/docker/tasks/main.yml': line 99, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appe
ars to be:\n\n# of some docker bug I've not investigated.\n- name: restart docker (first time)\n ^ here\n"}
...ignoring
All the infrastructure services are started as systemd units. When a service fails or is not behaving as desired collecting the output of following two commands is useful:
systemctl status <service-name>
sudo journalctl -xu <service-name>
The <service-name>
is the name of a systemd unit. It is usually same as the name of the daemon like docker, netplugin, netmaster, volplugin, volmaster and clusterm to name a few.