-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could not check lock ownership. Error: Cannot send after transport endpoint shutdown. #642
Comments
Could you attach the tcmu-runner.log ? |
Yes. The archive contains logs from both gateways |
Chcked the log, didn't find any issue in tcmu-runner. The error logs said that it was trying to check the lock state, but the ceph cluster just returned the ESHUTDOWN errno and then the tcmu-runner translate it to NOT READY to scsi proto, and we can see that the ESXI kept trying again and again, this is why you saw it hang. Have you checked the ceph logs ? Is there any suspecious failure ? |
I haven't found anything unusual in the logs yet, maybe I missed it. |
The logs should be in some files like:
If you set the following options in the
|
Here is the log from the gateway, with debug setting 20 for rados and rbd |
@lxbsz do you have any ideas? |
@serjponomarev Checked the logs just now, there are full of something like:
[Edited] This is the connection issue with the OSD nodes, yeah, the client was blocklisted. Could you attach the osd/mon related logs ? |
We need to know what has caused the client gotten blocklisted. |
I think the reason is two things:
error from rbd-target-api log
All packages are installed from download.ceph.com |
Is the ceph-iscsi or the tcmu-runner related threads cause the panic for you ?
Yeah, it will. If there is no respsonse from client side the OSD daemon will blacklist it.
After you reboot the tcmu-runner or reopen the rbd images in tcmu-runner, it should assign you a new nonce, which is a random number, it shouldn't block the new opened image, there is one case that the nonces are the same.
After one rbd image has been reopened, the previous stale blacklist entry makes no sense any more. And the ceph-iscsi service will try to remove the stale blacklist entries to reduce the occupation of resources in the osdmap in case if there have thsouands of entries. So when broadcasting the osdmap to clients it will make the network overload, etc. Here even it failed, as I mentioned above it shouldn't block your new opened images. I do not think this is the root cause. Usually there should be logs about why it gets blocklisted in OSD/Mon log files.
|
Hi, is this problem solved now? |
I have 4 ESXi hosts that are connected to a Ceph cluster through two gateways.
Sometimes, during the storage rescan, the tcmu-runner gets an error in the logs:
2020-10-19 16: 07: 37.963 2988 [ERROR] tcmu_rbd_has_lock: 516 rbd / rbd.storage-01: Could not check lock ownership. Error: Cannot send after transport endpoint shutdown.
After that, some hosts freeze on timeout due to storage unavailability.
What could be the problem?
Centos 8.2
kernel-core-4.18.0-193.19.1.el8_2.x86_64
ceph-14.2.11-0.el8.x86_64
tcmu-runner-1.5.2-1.el8.x86_64
ceph-iscsi-3.4-1.el8.noarch
The text was updated successfully, but these errors were encountered: