Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. #642

Open
serjponomarev opened this issue Oct 19, 2020 · 12 comments

Comments

@serjponomarev
Copy link

serjponomarev commented Oct 19, 2020

I have 4 ESXi hosts that are connected to a Ceph cluster through two gateways.
Sometimes, during the storage rescan, the tcmu-runner gets an error in the logs:

2020-10-19 16: 07: 37.963 2988 [ERROR] tcmu_rbd_has_lock: 516 rbd / rbd.storage-01: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.

After that, some hosts freeze on timeout due to storage unavailability.

What could be the problem?

Centos 8.2
kernel-core-4.18.0-193.19.1.el8_2.x86_64
ceph-14.2.11-0.el8.x86_64
tcmu-runner-1.5.2-1.el8.x86_64
ceph-iscsi-3.4-1.el8.noarch

@lxbsz
Copy link
Collaborator

lxbsz commented Oct 20, 2020

Could you attach the tcmu-runner.log ?

@serjponomarev
Copy link
Author

Yes. The archive contains logs from both gateways
gw_logs.zip

@lxbsz
Copy link
Collaborator

lxbsz commented Oct 22, 2020

Yes. The archive contains logs from both gateways
gw_logs.zip

Chcked the log, didn't find any issue in tcmu-runner.

The error logs said that it was trying to check the lock state, but the ceph cluster just returned the ESHUTDOWN errno and then the tcmu-runner translate it to NOT READY to scsi proto, and we can see that the ESXI kept trying again and again, this is why you saw it hang.

Have you checked the ceph logs ? Is there any suspecious failure ?

@serjponomarev
Copy link
Author

I haven't found anything unusual in the logs yet, maybe I missed it.
How can this error appear in the logs and in the logs of which service (mon, osd, mgr)?

@lxbsz
Copy link
Collaborator

lxbsz commented Oct 26, 2020

I haven't found anything unusual in the logs yet, maybe I missed it.
How can this error appear in the logs and in the logs of which service (mon, osd, mgr)?

The logs should be in some files like:

/var/log/ceph/client.admin.1462.log

If you set the following options in the /etc/ceph/ceph.conf file, something like:

[client]
log file = /var/log/ceph/$name.$pid.log                                                                                                                                  
debug rbd = 20
debug rados = 20

@serjponomarev
Copy link
Author

Here is the log from the gateway, with debug setting 20 for rados and rbd
client.admin.1455.zip

@serjponomarev
Copy link
Author

@lxbsz do you have any ideas?

@lxbsz
Copy link
Collaborator

lxbsz commented Nov 3, 2020

@lxbsz do you have any ideas?

@serjponomarev
Sorry for late, I was busying with other tasks.

Checked the logs just now, there are full of something like:

  9519 2020-10-29 02:21:05.581 7f40c7705700 10 librados: Objecter returned from call r=-108
  9520 2020-10-29 02:21:05.581 7f40c7705700  5 librbd::ManagedLock: 0x67d2000 assert_header_locked: client is not lock owner -- client blacklisted
  9521 2020-10-29 02:21:05.581 7f4102739700 20 librados: complete_aio_write 0x9ff7080
  9522 2020-10-29 02:21:05.581 7f40fcf2e700 10 librbd::managed_lock::ReleaseRequest: 0x9d040c0 handle_unlock: r=-108
  9523 2020-10-29 02:21:05.581 7f40fcf2e700 -1 librbd::managed_lock::ReleaseRequest: 0x9d040c0 handle_unlock: failed to unlock: (108) Cann       ot send after transport endpoint shutdown       

[Edited] This is the connection issue with the OSD nodes, yeah, the client was blocklisted.

Could you attach the osd/mon related logs ?

@lxbsz
Copy link
Collaborator

lxbsz commented Nov 3, 2020

@lxbsz do you have any ideas?

@serjponomarev
Sorry for late, I was busying with other tasks.

Checked the logs just now, there are full of something like:

  9519 2020-10-29 02:21:05.581 7f40c7705700 10 librados: Objecter returned from call r=-108
  9520 2020-10-29 02:21:05.581 7f40c7705700  5 librbd::ManagedLock: 0x67d2000 assert_header_locked: client is not lock owner -- client blacklisted
  9521 2020-10-29 02:21:05.581 7f4102739700 20 librados: complete_aio_write 0x9ff7080
  9522 2020-10-29 02:21:05.581 7f40fcf2e700 10 librbd::managed_lock::ReleaseRequest: 0x9d040c0 handle_unlock: r=-108
  9523 2020-10-29 02:21:05.581 7f40fcf2e700 -1 librbd::managed_lock::ReleaseRequest: 0x9d040c0 handle_unlock: failed to unlock: (108) Cann       ot send after transport endpoint shutdown       

[Edited] This is the connection issue with the OSD nodes, yeah, the client was blocklisted.

Could you attach the osd/mon related logs ?

We need to know what has caused the client gotten blocklisted.

@serjponomarev
Copy link
Author

serjponomarev commented Nov 3, 2020

I think the reason is two things:

  1. Sometimes the tcmu-runner "crashes" after which the host is rebooted automatically (the parameters are enabled kernel.panic = 5, kernel.hung_task_panic = 1).
  2. Perhaps after that the clients of this host get into the blacklist (I'm not sure, these are my thoughts)
  3. After rebooting the host and loading the rbd-target-api service does not correctly process the blacklist list.

error from rbd-target-api log

Nov 03 21:05:02 ceph-igw-02 systemd[1]: Started Ceph iscsi target configuration API.
Nov 03 21:05:02 ceph-igw-02 rbd-target-api[1701]: Started the configuration object watcher
Nov 03 21:05:02 ceph-igw-02 rbd-target-api[1701]: Checking for config object changes every 1s
Nov 03 21:05:02 ceph-igw-02 rbd-target-api[1701]: Processing osd blacklist entries for this node
Nov 03 21:05:03 ceph-igw-02 rbd-target-api[1701]: Removing blacklisted entry for this host : 10.255.255.102:0/2051314334
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]: Traceback (most recent call last):
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/bin/rbd-target-api", line 2951, in <module>
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     main()
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/bin/rbd-target-api", line 2861, in main
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     osd_state_ok = ceph_gw.osd_blacklist_cleanup()
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/lib/python3.6/site-packages/ceph_iscsi_config/gateway.py", line 110, in osd_blacklist_cleanup
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     rm_ok = self.ceph_rm_blacklist(blacklist_entry.split(' ')[0])
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/lib/python3.6/site-packages/ceph_iscsi_config/gateway.py", line 46, in ceph_rm_blacklist
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     if ("un-blacklisting" in result) or ("isn't blacklisted" in result):
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]: TypeError: a bytes-like object is required, not 'str'
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Main process exited, code=exited, status=1/FAILURE
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Failed with result 'exit-code'.
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Service RestartSec=100ms expired, scheduling restart.
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Scheduled restart job, restart counter is at 3.
Nov 03 21:05:04 ceph-igw-02 systemd[1]: Stopped Ceph iscsi target configuration API.

All packages are installed from download.ceph.com
@lxbsz What do you think? Could this be the reason?

@lxbsz
Copy link
Collaborator

lxbsz commented Nov 3, 2020

I think the reason is two things:

  1. Sometimes the tcmu-runner "crashes" after which the host is rebooted automatically (the parameters are enabled kernel.panic = 5, kernel.hung_task_panic = 1).

Is the ceph-iscsi or the tcmu-runner related threads cause the panic for you ?

  1. Perhaps after that the clients of this host get into the blacklist (I'm not sure, these are my thoughts)

Yeah, it will. If there is no respsonse from client side the OSD daemon will blacklist it.

  1. After rebooting the host and loading the rbd-target-api service does not correctly process the blacklist list.

After you reboot the tcmu-runner or reopen the rbd images in tcmu-runner, it should assign you a new nonce, which is a random number, it shouldn't block the new opened image, there is one case that the nonces are the same.

error from rbd-target-api log

Nov 03 21:05:02 ceph-igw-02 systemd[1]: Started Ceph iscsi target configuration API.
Nov 03 21:05:02 ceph-igw-02 rbd-target-api[1701]: Started the configuration object watcher
Nov 03 21:05:02 ceph-igw-02 rbd-target-api[1701]: Checking for config object changes every 1s
Nov 03 21:05:02 ceph-igw-02 rbd-target-api[1701]: Processing osd blacklist entries for this node
Nov 03 21:05:03 ceph-igw-02 rbd-target-api[1701]: Removing blacklisted entry for this host : 10.255.255.102:0/2051314334
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]: Traceback (most recent call last):
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/bin/rbd-target-api", line 2951, in <module>
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     main()
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/bin/rbd-target-api", line 2861, in main
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     osd_state_ok = ceph_gw.osd_blacklist_cleanup()
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/lib/python3.6/site-packages/ceph_iscsi_config/gateway.py", line 110, in osd_blacklist_cleanup
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     rm_ok = self.ceph_rm_blacklist(blacklist_entry.split(' ')[0])
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:   File "/usr/lib/python3.6/site-packages/ceph_iscsi_config/gateway.py", line 46, in ceph_rm_blacklist
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]:     if ("un-blacklisting" in result) or ("isn't blacklisted" in result):
Nov 03 21:05:04 ceph-igw-02 rbd-target-api[1701]: TypeError: a bytes-like object is required, not 'str'
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Main process exited, code=exited, status=1/FAILURE
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Failed with result 'exit-code'.
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Service RestartSec=100ms expired, scheduling restart.
Nov 03 21:05:04 ceph-igw-02 systemd[1]: rbd-target-api.service: Scheduled restart job, restart counter is at 3.
Nov 03 21:05:04 ceph-igw-02 systemd[1]: Stopped Ceph iscsi target configuration API.

After one rbd image has been reopened, the previous stale blacklist entry makes no sense any more. And the ceph-iscsi service will try to remove the stale blacklist entries to reduce the occupation of resources in the osdmap in case if there have thsouands of entries. So when broadcasting the osdmap to clients it will make the network overload, etc.

Here even it failed, as I mentioned above it shouldn't block your new opened images.

I do not think this is the root cause.

Usually there should be logs about why it gets blocklisted in OSD/Mon log files.

All packages are installed from download.ceph.com
@lxbsz What do you think? Could this be the reason?

@chengzhidada
Copy link

Hi, is this problem solved now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants