Skip to content

Latest commit

 

History

History
500 lines (341 loc) · 24.7 KB

troubleshooting.md

File metadata and controls

500 lines (341 loc) · 24.7 KB
title description lang
Troubleshooting
Troubleshooting errors reported by Storidge CIO software
en-US

Troubleshooting

Where is the report from the cioctl report command?

The output of the cioctl report command is in the /var/lib/storidge directory.

Please forward the report to [email protected] with details of the error you are troubleshooting.

Insufficient cluster capacity available to create this vdisk

Error message: "Fail: Add vd: Insufficient cluster capacity available to create this vdisk. Use smaller size"

If you are running a Storidge cluster on virtual servers or VMs, this error comes from a data collection process that creates twenty volumes and runs fio to collect performance data for Storidge's QoS feature.

The Storidge software will normally only run the data collection on physical servers. However the data collection can be started on virtual servers or VMs that are not on the supported list.

Please run the cioctl report command and forward the report in /var/lib/storidge directory to [email protected]. The report command will collect configuration information and logs including information on the virtual server. When forwarding the report, please make a request to add the virtual server to the supported list.

cio node ls shows node in maintenance mode and missing the node name. How do I recover the node?

This situation is likely a result of a node being cordoned or shutdown for maintenance. Then the cluster was rebooted or power cycled.

Once the cluster is rebooted, the node that was previously in maintenance mode will still stay in maintenance mode. The output of the cio node ls command may look something like this:

root@u1:~# cio node ls
NODENAME             IP                NODE_ID    ROLE       STATUS      VERSION
                     192.168.3.95      d12a81bd   sds        maintenance
u3                   192.168.3.29      7517e436   backup1    normal      V1.0.0-2986
u4                   192.168.3.91      91a78c14   backup2    normal      V1.0.0-2986
u1                   192.168.3.165     a11314f0   storage    normal      V1.0.0-2986
u5                   192.168.3.160     888a7dd3   storage    normal      V1.0.0-2986

To restore the cordoned node, you can:

  1. Login to the cordoned node and run cioctl node uncordon to rejoin the node to the cluster

  2. Uncordon the node by running cioctl node uncordon <IP address> from any node. In the example above, run cioctl node uncordon 192.168.3.95. The Storidge software does not depend on identifiers that can be changed by users, e.g. hostname.

  3. Reset or power cycle the cordoned node and it will automatically rejoin the cluster after rebooting

dockerd: msg="Node 085d698b3d2e/10.0.2.235, added to failed nodes list"

Error message: dockerd: time="2019-10-19T03:18:22.862011422Z" level=info msg="Node 085d698b3d2e/10.0.2.235, added to failed nodes list"

The error message indicates that internode cluster traffic is being interrupted. This could be a result of network interface failure or network bandwidth being saturated with too much incoming data. This will impact the ability of the Storidge cluster to maintain state.

Suggestions are:

  1. Monitoring bandwidth usage for each instance to confirm if network bandwidth is being exhausted. Entries in syslog that indicate nodes added to failed list, ISCSI connection issues, or missing heartbeats are also indicators of network congestion.

  2. If there is only one network interface per instance, it will be supporting incoming data streams, orchestrator system internode traffic and Storidge data traffic.

For use cases handling a lot of front end data, consider splitting off the storage traffic to a separate network, e.g. use instances with two network interfaces. Assign an interface for front-end network traffic and assign second interface for storage network.

When creating the Storidge cluster, you can specify which network interface to use with the --ip flag, e.g. run cioctl create --ip 10.0.1.51. When you run the cioctl node join command on the storage nodes, it will suggest an IP address from the same subnet.

  1. Verify if incoming data is going to just one node. Consider approaches such as a load balancer to spread incoming data across multiple nodes.

  2. Calculate the amount of network bandwidth that will be generated by your use case. Verify that the network interface is capable of sustaining the data throughput. For example, a 10GigE interface can sustain about 700MB/s.

  3. In calculations for data throughput, note that for every 100MB/s of incoming data, there is a multiple of the throughput used for replicating data. For 2-copy volumes, 100MB/s will be written to local node and 100MB/s will go through the network interface to other nodes as replicated data, i.e. 100MB/s incoming data stream results in 200MB/s of used network bandwidth.

Fail: Major version number change, cannot update to 3336

Error message: Fail: Major version number change, cannot update to 3336

This error indicates that a software update on a node is not forward compatible using the cioctl node update command.

This incompatibility is marked by bumping up the major version number of the software release. Possible reasons for update incompatibility can be metadata format changes for major new features, protocol changes for internode communications, etc.

Although cluster aware updates with cioctl node update is not possible, Storidge will provide steps for updating a cluster to latest software release whenever possible. Please check release notes at docs.storidge.com or contact [email protected].

dockerd: dockerd: level=warning msg="failed to create proxy for port 9999: listen tcp :9999: bind: address already in use"

Error message: dockerd: time="2019-10-10T17:35:59.961861284Z" level=warning msg="failed to create proxy for port 9999: listen tcp :9999: bind: address already in use"

The error message indicates a network port conflict between services. The example above indicates that port number 9999 is being used by more than one service on the node.

Verify there are no conflicts with port numbers used by Storidge cluster.

"iscsid: Kernel reported iSCSI connection 2:0 error"

Error message: iscsid: Kernel reported iSCSI connection 2:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

The error message indicates an iscsi connectivity issue between cluster nodes. This could be a result of conflicts such as duplicate iscsi initiator names or other networking issues.

For a multi-node cluster to function correctly, the ISCSI initiator name on each node much be unique. Display the ISCSI initiator name on each node by running cat /etc/iscsi/initiatorname.iscsi, and confirm they are different.

If the ISCSI initiator name is not unique, you can change it with:

echo "InitiatorName=`/sbin/iscsi-iname`" > /etc/iscsi/initiatorname.iscsi

Since the ISCSI initiator name is used to setup connections to ISCSI targets during cluster initialization, it must be made unique before running cioctl create to start a cluster.

"Fail: node is already a member of a multi-node cluster"

Error message: Fail: node is already a member of a multi-node cluster

This error message in syslog indicates an attempt to add a node to the cluster that is already a member. Check your script or playbook to verify that the cioctl join command is being issued to a storage(worker) node and not the primary (sds) node.

This error can result in related message below which indicates the Storidge CIO kernel modules are incorrectly unloaded, breaking cluster initialization.

[DFS] dfs_exit:18218:dfs module unloaded

[VD ] vdisk_exit:2916:vd module unloaded

Get http://172.23.8.104:8282/metrics: dial tcp 172.23.8.104:8282: connect: connection refused

Error message: connect: connection refused

Getting a "Connection refused" errors on requests to an API endpoint likely means that the API server on the node is not running.

Run ps aux |grep cio-api to confirm. If not listed, run cio-api & on the node to restart the API.

Also run cioctl report to generate a cluster report which will be saved to file /var/lib/storidge/report.txz. Please forward the cluster report to [email protected] with details of the error for analysis.

"cluster: Number of remote drives on host 0 (IP 10.11.14.87) 0 is not the expected 9"

Error message: Number of remote drives on host 0 (IP 10.11.14.87) 0 is not the expected 9

This error during cluster initialization can result from a bad configuration in the iscsid.service file.

Example error from cluster initialization:

[root@EV15-HA1 ~]#  cioctl init 67a2c2e8
Warning: Permanently added '10.11.14.90' (ECDSA) to the list of known hosts.
cluster: initialization started
cluster: Copy auto-multiNode-EV15-HA1.cfg to all nodes (NODE_NUMS:4)
cluster: Initialize target
cluster: Initialize initiator
cluster: Start node initialization
node: Clear drives
node: Load module
node: Add node backup relationship
node: Check drives
Adding disk /dev/sdb SSD to storage pool
Adding disk /dev/sdc SSD to storage pool
Adding disk /dev/sdd SSD to storage pool
Adding disk /dev/sde SSD to storage pool
Adding disk /dev/sdf SSD to storage pool
Adding disk /dev/sdg SSD to storage pool
Adding disk /dev/sdh SSD to storage pool
Adding disk /dev/sdi SSD to storage pool
Adding disk /dev/sdj SSD to storage pool
Adding disk /dev/sdk SSD to storage pool
Adding disk /dev/sdl SSD to storage pool
Adding disk /dev/sdm SSD to storage pool
node: Collect drive IOPS and BW: Total IOPS:26899  Total BW:1479.9MB/s
node: Initializing metadata
cluster: Node initialization completed
cluster: Number of remote drives on host 0 (IP 10.11.14.87) 0 is not the expected 9
cluster: Number of remote drives on host 1 (IP 10.11.14.88) 0 is not the expected 9
cluster: Number of remote drives on host 2 (IP 10.11.14.89) 0 is not the expected 9
cluster: Number of remote drives on host 3 (IP 10.11.14.90) 0 is not the expected 9
cluster: Cannot initialize cluster
cluster: 'cioctl clusterdeinit default.cfg 1' started
cluster: Killing MongoDB daemons
cluster: Killing cio daemons
cluster: Uninitialize initiator
cluster: Uninitialize target

Try reinstalling the Storidge software to correct the issue. For example, on release 3249 on Ubuntu 18.04, run:

cd /var/lib/storidge/cio-3249-u18.amd64
./install

Clean the node before running initialization again:

cioctl node clean --force

"cioctl: insmod: ERROR: could not insert module"

Error message: Feb 4 16:48:02 EV15-HA1 cioctl: insmod: ERROR: could not insert module /lib/modules/3.10.0-1062.el7.x86_64/kernel/drivers/storidge/vd.ko: File exists

If you are running with VMs in a vSphere environment, the error message above means that secure boot is enabled for the VM. Since the Storidge software will insert a kernel module, secure boot needs to be disabled.

To turn secure boot off, the VM must first be powered off. Then right-click the VM, and select Edit Settings. Click the VM Options tab, and expand Boot Options. Under Boot Options, ensure that firmware is set to EFI.

Deselect the Secure Boot check box to disable secure boot. Click OK.

"Unable to connect to the Docker environment" from Portainer UI

Error message: Unable to connect to the Docker environment

The Portainer service connects to Docker through a unix socket on the local node. This error indicates Portainer is not able to talk to the Docker API.

Remove the existing Portainer service with docker stack remove portainer. Then redeploy the Portainer service and agents with docker stack deploy -c /etc/storidge/config/portainer.yml portainer

Configuration Error: Cannot determine drive count on node at 10.11.14.87. Verify data drives have no filesystem or partitions

Error message: Configuration Error: Could not determine drive count on node 10.11.14.87 at 10.11.14.87. Verify data drives available

This error message while initializing a cluster indicates that while drives are available on the nodes, they may be formatted with filesystem or have partitions. The Storidge software will not add these drives to the storage pool since there may be user data.

Use the file -sL <device> command to check. For example, drives sdb, sdc and sdd below can be discovered and consumed by Storidge. However drive sda below will be skipped.

root@ubuntu-16:~# file -sL /dev/sd*
/dev/sda:  DOS/MBR boot sector
/dev/sda1: Linux rev 1.0 ext2 filesystem data (mounted or unclean), UUID=f838091f-e90f-4037-8352-4d7d2775667a (large files)
/dev/sda2: DOS/MBR boot sector; partition 1 : ID=0x8e, start-CHS (0x5d,113,21), end-CHS (0x3ff,254,63), startsector 2, 40439808 sectors, extended partition table (last)
/dev/sda5: LVM2 PV (Linux Logical Volume Manager), UUID: Tx8zdm-LIyl-Am4b-0Bbu-iJxv-yKsI-IR9NtO, size: 20705181696
/dev/sdb:  data
/dev/sdc:  data
/dev/sdd:  data

Use dd to wipe out metadata and make the drive available for Storidge. For example, to clear drive /dev/sdb:

dd if=/dev/zero of=/dev/sdb bs=1M count=300

Cluster breaks on VMware vSphere snapshot with error "[SDS] node_mgmt:14380:WARNING: node pingable: node[0].node_id:ab7dc460 [172.164.2.21] last_alive_sec"

Error message: [SDS] node_mgmt:14380:WARNING: node pingable: node[0].node_id:ab7dc460 [172.164.2.21] last_alive_sec

When you take a snapshot of a vSphere virtual machine with memory, e.g. for Veeam backup:

  • The virtual machine becomes unresponsive or inactive
  • The virtual machine does not respond to any commands
  • You cannot ping the virtual machine

This is expected behavior in ESXi. Before a backup a snapshot is taken, the backup job runs, and finally the snapshot is removed. This causes the VM to lose connectivity for a period dependent on the amount of memory and changed data.

Storidge uses heartbeats to monitor the health of cluster nodes. When a VM does not respond for an extended time it is marked as a failed node. Losing access to multiple nodes can potentially break a cluster. It is not recommended to use VMware snapshot for backup of Storidge nodes.

Storidge will be introducing a backup service for cluster workloads. This will be based on volume snapshots, i.e. the backups are at granularity of a container and does not require a node to be suspended.

Bad iscsid.service configuration /lib/systemd/system/iscsid.service:5: Missing '='

Error message: /lib/systemd/system/iscsid.service:5: Missing '='

This error indicates a bad configuration setting in the iscsi daemon service file located at /lib/systemd/system/iscsid.service.

On Ubuntu, a correct service file will show:

[Unit]
Description=iSCSI initiator daemon (iscsid)
Documentation=man:iscsid(8)
Wants=network-online.target remote-fs-pre.target
Before=cio.service docker.service remote-fs-pre.target
After=network.target network-online.target
DefaultDependencies=no
Conflicts=shutdown.target
Before=shutdown.target
ConditionVirtualization=!private-users
[Service]
Type=forking
PIDFile=/run/iscsid.pid
ExecStartPre=/lib/open-iscsi/startup-checks.sh
ExecStart=/sbin/iscsid
[Install]
WantedBy=sysinit.target

On Centos, a correct service file will show:

[Unit]
Description=Login and scanning of iSCSI devices
Documentation=man:iscsiadm(8) man:iscsid(8)
DefaultDependencies=no
Before=remote-fs-pre.target
After=network.target network-online.target iscsid.service iscsiuio.service systemd-remount-fs.service
Wants=remote-fs-pre.target iscsi-shutdown.service
ConditionDirectoryNotEmpty=/var/lib/iscsi/nodes

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=-/sbin/iscsiadm -m node --loginall=automatic
ExecReload=-/sbin/iscsiadm -m node --loginall=automatic
SuccessExitStatus=21

[Install]
WantedBy=remote-fs.target

Try reinstalling the Storidge software to correct the issue. For example, on release 3249 on Ubuntu 18.04, run:

cd /var/lib/storidge/cio-3249-u18.amd64
./install

Clean the node before running initialization again:

cioctl node clean --force

Bad iscsid.service configuration iscsid.socket: Socket service iscsid.service not loaded, refusing.

Error message: iscsid.socket: Socket service iscsid.service not loaded, refusing.

This error can result from a bad configuration setting in the iscsi daemon service file located at /lib/systemd/system/iscsid.service.

Try reinstalling the Storidge software to correct the issue. For example, on release 3249 on Ubuntu 18.04, run:

cd /var/lib/storidge/cio-3249-u18.amd64
./install

Clean the node before running initialization again:

cioctl node clean --force

Fail: Problem with key exchange

Error message: Fail: Problem with key exchange (exit code 2: request keys: Post "http://192.168.1.201:16994/join": dial tcp 192.168.1.201:16994: i/o timeout)

If you are seeing this error message when joining a node to a cluster, it means connectivity between this node and the primary or sds node is blocked. Example:

root@u2:~#     cioctl join 192.168.1.201 b305b3f15fa2829b45a6faa7328afaa2-3d4f9324 --ip 192.168.1.202
Fail: Problem with key exchange (exit code 2: request keys: Post "http://192.168.1.201:16994/join": dial tcp 192.168.1.201:16994: i/o timeout)

Check your firewall settings. The Storidge cluster requires a number of open ports to communication between nodes. See this link for a list of ports.

If you have ufw enabled on Ubuntu, you can add the required ports with:

sudo ufw allow 3260/tcp
sudo ufw allow 8282/tcp
sudo ufw allow 8383/tcp
sudo ufw allow 16990/tcp
sudo ufw allow 16995/tcp
sudo ufw allow 16996/tcp
sudo ufw allow 16997/tcp
sudo ufw allow 16998/tcp
sudo ufw allow 16999/tcp

Check status of the ports with sudo ufw status.

If you are operating a Docker Swarm cluster for your applications, note that the following ports must be open:

  • TCP port 2377 for cluster management communications
  • TCP and UDP port 7946 for communication among nodes
  • UDP port 4789 for overlay network traffic

To add these ports to ufw, run:

sudo ufw allow 2377/tcp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp
sudo ufw allow 4789/udp

Fail: The swarm does not have a leader. It's possible that too few managers are online

Error message: Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

If you've installed and deployed a Storidge cluster but hit the error message above, it's likely that the Docker Swarm cluster was not configured. The reason could be a firewall blocking inter-node communications.

On Ubuntu, you can check ufw status with sudo ufw status.

To operate a Swarm cluster for your applications, note that the following ports must be open:

  • TCP port 2377 for cluster management communications
  • TCP and UDP port 7946 for communication among nodes
  • UDP port 4789 for overlay network traffic

To add these ports to ufw, run:

sudo ufw allow 2377/tcp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp
sudo ufw allow 4789/udp

BTRFS info (device vdisk/vd1): forced readonly

Error message: Oct 16 09:40:27 manager-0 kernel: [ 4410.126209] BTRFS info (device vdisk/vd1): forced readonly

A 'forced readonly' error indicates that the btrfs filesystem ran out of working space. As a result, the volume is put into read only mode. To recover the volume, older snapshots on the volume can be removed to clear some space.

The example below assumes vd1 mounted at /cio/volumes/vd1 is the volume to recover:

  1. Unmount the volume and remount with clear_cache enabled
root@test:/# umount /cio/volumes/vd1
root@test:/# mount -o rw,clear_cache /dev/vdisk/vd1 /cio/volumes/vd1
  1. List the snapshots (subvolumes). Example:
root@test:/# btrfs subvolume list /cio/volumes/vd1
ID 635 gen 40013 top level 5 path .snap/2020-10-16-0255-edc70723-0000010
ID 636 gen 40020 top level 5 path .snap/2020-10-16-0256-edc70723-0000010
ID 638 gen 40268 top level 5 path .snap/2020-10-16-0356-edc70723-0000010
ID 639 gen 40495 top level 5 path .snap/2020-10-16-0456-edc70723-0000010
ID 640 gen 40502 top level 5 path .snap/2020-10-16-0457-edc70723-0000010
ID 641 gen 40743 top level 5 path .snap/2020-10-16-0556-edc70723-0000010
ID 642 gen 40750 top level 5 path .snap/2020-10-16-0557-edc70723-0000010
ID 643 gen 40994 top level 5 path .snap/2020-10-16-0656-edc70723-0000010
ID 644 gen 41001 top level 5 path .snap/2020-10-16-0657-edc70723-0000010
ID 645 gen 41411 top level 5 path .snap/2020-10-16-0930-edc70723-0000010
  1. Delete one or more snapshots (subvolumes) to clear space:
root@test:/# btrfs subvolume delete /cio/volumes/vd1/.snap/2020-10-16-0255-edc70723-0000010
  1. The filesystem should now be read writable. Expand volume capacity to add more space. For example to add 20GB, run:
root@test:/# cio volume update -V 1 -g 20

syntax error near unexpected token `newline'

Error message: /usr/bin/cioupdate: line 330: syntax error near unexpected token `newline'

This message indicates a syntax error in a cioupdate script file used for performing cluster aware updates of Storidge nodes. This error affects 3411 and 3450 releases.

To perform the node cluster, the cioupdate script file must first be patched with:

curl -fsSL https://download.storidge.com/pub/ce/update.sh | sudo bash -s

After patching the cioupdate script, each node can then be updated to latest release with:

cioctl node update NODENAME

For more information on the update process, see details at: https://docs.storidge.com/cioctl_cli/node.html#cioctl-node-update

Warning file /etc/storidge/certs/authorized_keys does not exist

Error message: Warning file /etc/storidge/certs/authorized_keys does not exist

This message likely indicates that a node installed with v2.0.0 Storidge software was trying to join a cluster running v1.0.0 software, e.g.:

[root@sdsnode ~]# cioctl node add 10.11.14.94 f1bf0a1dd5e5c2b3d6d09b1ba8fff93c-73db964a --ip 10.11.14.12
Warning file /etc/storidge/certs/authorized_keys does not exist
TCP port 16995 did not open after 20 seconds.
Cannot connect to 10.11.14.94. Retry command or check for connectivity and routing issues

Before a node running v2.0.0 software can be added, the cluster must first be updated to v2.0.0 also. To update the cluster:

  1. Update each node of cluster to build 3249 which is the last version on v1.0,0. From the sds node, run:
cioctl node update NODENAME --version 3249
  1. After the cluster nodes are updated to 3249, follow steps in: https://faq.storidge.com/software.html#how-to-update-a-storidge-cluster-from-release-v1-0-0-3249-and-below-to-latest-release to update cluster to latest v2.0.0 software.

  2. Create a join-token and add the new node. Ensure that the new node has a Storidge software release that matches the cluster.

This system is not registered with an entitlement server. You can use subscription-manager to register.

Error message: This system is not registered with an entitlement server. You can use subscription-manager to register.

If you are running Centos 7.9 and see this error, the is likely because the subscription manager setting is enabled. In file '/etc/yum/pluginconf.d/subscription-manager.conf', change the setting enabled=0. Run the installation again.

If you are running RHEL and do have a subscription manager, please follow the instructions from Red Hat to enable the entitlement server.

Running out of system memory

Error message: Fail: Create vd9: Running out of system memory

The Storidge software issues a warning when system memory is more than 80% full. Above this threshold it will not create new volumes to ensure sufficient system memory for critical system processes and applications.

Run top to confirm the memory usage. The Storidge software has successfully operated in environments as low as 2GB system memory. If you have greater memory resources in your system, You may want to check for memory leaks.

You can also run cioctl report. This will collect cluster info including system logs into a report.txz file at /var/lib/storidge. Post the report into our cio-user slack channel, and we can look for details on the memory warning.

There are insufficient (1) drives on this node

Error message: There are insufficient (1) drives on this node. Attach more data drives for cio operation

This error indicates there are insufficient data drives or size of the data drive may be too small. You can verify the Storidge software prerequisites here: https://docs.storidge.com/prerequisites/hardware.html

The Storidge software will auto discover and only use drives that are not formatted with file system or has partitions. This is to avoid accidentally wiping out a drive with data. If you already have one boot drive and 3 data drives, it is possible that one of the data drives has partitions or a file system. You can verify with file -sL /dev/sd*