Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmu-runner thread block #677

Open
zhangdaolong opened this issue Nov 8, 2021 · 13 comments
Open

tcmu-runner thread block #677

zhangdaolong opened this issue Nov 8, 2021 · 13 comments

Comments

@zhangdaolong
Copy link

zhangdaolong commented Nov 8, 2021

If the client formats the disk, it will stuck after a period of time. Below are log information and core file analysis。
tcmu-runner version tcmu-runner-1.4.0.0
tcmu-runner.log

tcmu_rbd_handle_timedout_cmd:992 rbd/test.volume: Timing out cmd.

gcore tcmu-runner process

gdb /usr/bin/tcmu-runner ./core.24117

[New LWP 24118]
[New LWP 24119]
[New LWP 24122]
[New LWP 24123]
[New LWP 24124]
[New LWP 24125]
[New LWP 24130]
[New LWP 24131]
[New LWP 24132]
[New LWP 24133]
[New LWP 24134]
[New LWP 24135]
[New LWP 24136]
[New LWP 24137]
[New LWP 24138]
[New LWP 24139]
[New LWP 24140]
[New LWP 24141]
[New LWP 24142]
[New LWP 24143]
[New LWP 24144]
[New LWP 24146]
[New LWP 24117]
(gdb) info threads
Id Target Id Frame
23 Thread 0x7fd4f222a880 (LWP 24117) 0x00007fd4f07c4bed in poll () from /lib64/libc.so.6
22 Thread 0x7fd4a67fc700 (LWP 24146) 0x00007fd4f07c4bed in poll () from /lib64/libc.so.6
21 Thread 0x7fd4a77fe700 (LWP 24144) 0x00007fd4f07c4bed in poll () from /lib64/libc.so.6
20 Thread 0x7fd4a7fff700 (LWP 24143) 0x00007fd4f07c4caf in ppoll () from /lib64/libc.so.6
19 Thread 0x7fd4c4b8b700 (LWP 24142) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
18 Thread 0x7fd4c538c700 (LWP 24141) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
17 Thread 0x7fd4c5b8d700 (LWP 24140) 0x00007fd4f12ebda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
16 Thread 0x7fd4c638e700 (LWP 24139) 0x00007fd4f12ebda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
15 Thread 0x7fd4c6b8f700 (LWP 24138) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
14 Thread 0x7fd4c7390700 (LWP 24137) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
13 Thread 0x7fd4c8b8b700 (LWP 24136) 0x00007fd4f12ebda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
12 Thread 0x7fd4c938c700 (LWP 24135) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
11 Thread 0x7fd4c9b8d700 (LWP 24134) 0x00007fd4f12ebda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
10 Thread 0x7fd4ca38e700 (LWP 24133) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
9 Thread 0x7fd4cab8f700 (LWP 24132) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
8 Thread 0x7fd4cb390700 (LWP 24131) 0x00007fd4f12ebda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
7 Thread 0x7fd4cbb91700 (LWP 24130) 0x00007fd4f12ebda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x7fd4c7b91700 (LWP 24125) 0x00007fd4f07cfe63 in epoll_wait () from /lib64/libc.so.6
5 Thread 0x7fd4cc392700 (LWP 24124) 0x00007fd4f07cfe63 in epoll_wait () from /lib64/libc.so.6
4 Thread 0x7fd4ccb93700 (LWP 24123) 0x00007fd4f07cfe63 in epoll_wait () from /lib64/libc.so.6
3 Thread 0x7fd4cdb15700 (LWP 24122) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2 Thread 0x7fd4edc44700 (LWP 24119) 0x00007fd4f12ee71d in read () from /lib64/libpthread.so.0

  • 1 Thread 0x7fd4ee445700 (LWP 24118) 0x00007fd4f12eb9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

pstack trace

Thread 23 (Thread 0x7f1a4044c700 (LWP 3084529)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a43e1e412 in log_thread_start () from /lib64/libtcmu.so.2
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 22 (Thread 0x7f1a3fc4b700 (LWP 3084530)):
#0 0x00007f1a432f571d in read () from /lib64/libpthread.so.0
#1 0x00007f1a43e1f62e in dyn_config_start () from /lib64/libtcmu.so.2
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 21 (Thread 0x7f1a35d62700 (LWP 3084531)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a2760ee3d in ceph::logging::Log::entry() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 20 (Thread 0x7f1a34de0700 (LWP 3084532)):
#0 0x00007f1a427d6e63 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f1a279b5f49 in EpollDriver::event_wait(std::vector<FiredFileEvent, std::allocator >&, timeval*) () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a2776acf7 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >) () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a2776e1ec in std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/lib64/ceph/libceph-common.so.0
#4 0x00007f1a27a5f94f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0
#5 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x7f1a1eb7d700 (LWP 3084533)):
#0 0x00007f1a427d6e63 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f1a279b5f49 in EpollDriver::event_wait(std::vector<FiredFileEvent, std::allocator >&, timeval
) () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a2776acf7 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >) () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a2776e1ec in std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/lib64/ceph/libceph-common.so.0
#4 0x00007f1a27a5f94f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0
#5 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x7f1a1e37c700 (LWP 3084534)):
#0 0x00007f1a427d6e63 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f1a279b5f49 in EpollDriver::event_wait(std::vector<FiredFileEvent, std::allocator >&, timeval
) () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a2776acf7 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >) () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a2776e1ec in std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/lib64/ceph/libceph-common.so.0
#4 0x00007f1a27a5f94f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0
#5 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7f1a1db7b700 (LWP 3084539)):
#0 0x00007f1a432f2da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a2782f190 in CephContextServiceThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7f1a1d37a700 (LWP 3084540)):
#0 0x00007f1a432f2da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a3e1e299f in ceph::timer_detail::timerceph::time_detail::coarse_mono_clock::timer_thread() () from /lib64/librados.so.2
#2 0x00007f1a27a5f94f in execute_native_thread_routine () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7f1a1cb79700 (LWP 3084541)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a276a903c in DispatchQueue::entry() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a2774753d in DispatchQueue::DispatchThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7f1a17fff700 (LWP 3084542)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a276ab120 in DispatchQueue::run_local_delivery() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a2774764d in DispatchQueue::LocalDeliveryThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7f1a177fe700 (LWP 3084543)):
#0 0x00007f1a432f2da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275e9f05 in SafeTimer::timer_thread() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a275eb12d in SafeTimerThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7f1a16ffd700 (LWP 3084544)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275eb975 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f1a167fc700 (LWP 3084545)):
#0 0x00007f1a432f2da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275e9f05 in SafeTimer::timer_thread() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a275eb12d in SafeTimerThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f1a15ffb700 (LWP 3084546)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275e9cc9 in SafeTimer::timer_thread() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a275eb12d in SafeTimerThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f1a157fa700 (LWP 3084547)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275eb975 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f1a14ff9700 (LWP 3084548)):
#0 0x00007f1a432f2da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275f22b9 in ThreadPool::worker(ThreadPool::WorkThread
) () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a275f3ba0 in ThreadPool::WorkThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f19fffff700 (LWP 3084549)):
#0 0x00007f1a432f2da2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275e9f05 in SafeTimer::timer_thread() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a275eb12d in SafeTimerThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f19ff7fe700 (LWP 3084550)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275e9cc9 in SafeTimer::timer_thread() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a275eb12d in SafeTimerThread::entry() () from /usr/lib64/ceph/libceph-common.so.0
#3 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f19feffd700 (LWP 3084551)):
#0 0x00007f1a432f29f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f1a275eb975 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.0
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f19fe7fc700 (LWP 3084552)):
#0 0x00007f1a427cbcaf in ppoll () from /lib64/libc.so.6
#1 0x00000000004122ba in tcmur_cmdproc_thread ()
#2 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f19fdffb700 (LWP 3084553)):
#0 0x00007f1a427cbbed in poll () from /lib64/libc.so.6
#1 0x00007f1a4354f37c in g_main_context_iterate.isra.19 () from /lib64/libglib-2.0.so.0
#2 0x00007f1a4354f4ac in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3 0x00007f1a4354f4e9 in glib_worker_main () from /lib64/libglib-2.0.so.0
#4 0x00007f1a43576540 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f19fcff9700 (LWP 3084555)):
#0 0x00007f1a427cbbed in poll () from /lib64/libc.so.6
#1 0x00007f1a4354f37c in g_main_context_iterate.isra.19 () from /lib64/libglib-2.0.so.0
#2 0x00007f1a4354f6ca in g_main_loop_run () from /lib64/libglib-2.0.so.0
#3 0x00007f1a43b39c36 in gdbus_shared_thread_func () from /lib64/libgio-2.0.so.0
#4 0x00007f1a43576540 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5 0x00007f1a432eee65 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f1a427d688d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f1a44231880 (LWP 3084528)):
#0 0x00007f1a427cbbed in poll () from /lib64/libc.so.6
#1 0x00007f1a4354f37c in g_main_context_iterate.isra.19 () from /lib64/libglib-2.0.so.0
#2 0x00007f1a4354f6ca in g_main_loop_run () from /lib64/libglib-2.0.so.0
#3 0x000000000041336d in main ()

@pgh-pamela
Copy link

hello, how to solve the problem, I have meet the same problem.

@lxbsz
Copy link
Collaborator

lxbsz commented Nov 23, 2021

tcmu-runner-1.4.0 is very old, have you guys tried the latest code ?

@lxbsz
Copy link
Collaborator

lxbsz commented Nov 23, 2021

From the log, I see the ceph IO was timedout. It took too long to finish.

tcmu_rbd_handle_timedout_cmd:992 rbd/test.volume: Timing out cmd.

You need to check what has happened in ceph side, why timed out this IO ?

@zhangdaolong
Copy link
Author

hello, how to solve the problem, I have meet the same problem.
Previously, I enabled rbd_cache to improve performance,and i disable rbd cache for librbd,and the problem is not appear。

@lxbsz
Copy link
Collaborator

lxbsz commented Nov 23, 2021

From ceph doc it says:

Ceph supports write-back caching for RBD. To enable it, add rbd cache = true to the [client] section
of your ceph. With caching enabled, writes return immediately, unless there are more than rbd cache
max dirty unflushed bytes. In this case, the write triggers writeback and blocks until enough bytes are
flushed.

I think it should be waiting for the dirty data being flushed, maybe there has too many dirty data ?

Can you see any related logs in ceph side ?

@zhangdaolong
Copy link
Author

zhangdaolong commented Nov 23, 2021

From the log, I see the ceph IO was timedout. It took too long to finish.

tcmu_rbd_handle_timedout_cmd:992 rbd/test.volume: Timing out cmd.

You need to check what has happened in ceph side, why timed out this IO ?
hello xiubo
I find tcmu-runner will disable/enable when IO timeout,However, sometimes the tcmu-runner service will be restarted during TPG restart(disable/enable), resulting in TPG not being enabled success. And the initiator also break.
In my usage scenario, a single gateway is used, so I don't think it is necessary to restart TPG. For example, add the following code to judge whether to restart according to the number of gateways.


target.c
tatic bool is_single_tpg(struct tgt_port_grp *tpg)
{
char path[PATH_MAX];
struct dirent **namelist;
int n = 0;

    snprintf(path, sizeof(path), CFGFS_ROOT"/%s/%s/", tpg->fabric, tpg->wwn);
    tcmu_info("Check %s tpgt count.\n", path);
    n = scandir(path, &namelist, tpg_filter, alphasort);
    if (n < 2){
        return false;
    }
    return true;

}
static void tgt_port_grp_recovery_work_fn(void *arg)
{
struct tgt_port_grp *tpg = arg;
struct tcmur_device *rdev, *tmp_rdev;
bool enable_tpg = false;
int ret;

    tcmu_dbg("Disabling %s/%s/tpgt_%hu.\n", tpg->fabric, tpg->wwn,
              tpg->tpgt);
    /*
     * This will return when all running commands have completed at
     * the target layer. Handlers must call tcmu_notify_lock_lost
     * before completing the failed command, so the device will be on
     * the list reopen list when setting enable=0 returns..
     */
    **if (is_single_tpg(tpg)){**
        tcmu_info("Skip to restart single tpg.\n");
        pthread_mutex_lock(&tpg_recovery_lock);
        list_del(&tpg->recovery_entry);
        pthread_mutex_unlock(&tpg_recovery_lock);
        free_tgt_port_grp(tpg);
        return;
    }
    ret = tcmu_set_tpg_int(tpg, "enable", 0);

......
}

I don't know if there are other problems that are not considered in this implementation。

@zhangdaolong
Copy link
Author

I

tcmu-runner-1.4.0 is very old, have you guys tried the latest code ?

No,I'll try later。

@zhangdaolong
Copy link
Author

I don't see any exception logs in ceph.client,but i enable rbd_cache ,and will appear Flush Thread in pthread_cond in dump file.

@lxbsz
Copy link
Collaborator

lxbsz commented Nov 23, 2021

From the log, I see the ceph IO was timedout. It took too long to finish.

tcmu_rbd_handle_timedout_cmd:992 rbd/test.volume: Timing out cmd.

You need to check what has happened in ceph side, why timed out this IO ?
hello xiubo
I find tcmu-runner will disable/enable when IO timeout,However, sometimes the tcmu-runner service will be restarted during TPG restart(disable/enable), resulting in TPG not being enabled success. And the initiator also break.
In my usage scenario, a single gateway is used, so I don't think it is necessary to restart TPG. For example, add the following code to judge whether to restart according to the number of gateways.

target.c tatic bool is_single_tpg(struct tgt_port_grp *tpg) { char path[PATH_MAX]; struct dirent **namelist; int n = 0;

    snprintf(path, sizeof(path), CFGFS_ROOT"/%s/%s/", tpg->fabric, tpg->wwn);
    tcmu_info("Check %s tpgt count.\n", path);
    n = scandir(path, &namelist, tpg_filter, alphasort);
    if (n < 2){
        return false;
    }
    return true;

} static void tgt_port_grp_recovery_work_fn(void *arg) { struct tgt_port_grp *tpg = arg; struct tcmur_device *rdev, *tmp_rdev; bool enable_tpg = false; int ret;

    tcmu_dbg("Disabling %s/%s/tpgt_%hu.\n", tpg->fabric, tpg->wwn,
              tpg->tpgt);
    /*
     * This will return when all running commands have completed at
     * the target layer. Handlers must call tcmu_notify_lock_lost
     * before completing the failed command, so the device will be on
     * the list reopen list when setting enable=0 returns..
     */
    **if (is_single_tpg(tpg)){**
        tcmu_info("Skip to restart single tpg.\n");
        pthread_mutex_lock(&tpg_recovery_lock);
        list_del(&tpg->recovery_entry);
        pthread_mutex_unlock(&tpg_recovery_lock);
        free_tgt_port_grp(tpg);
        return;
    }
    ret = tcmu_set_tpg_int(tpg, "enable", 0);

...... }

I don't know if there are other problems that are not considered in this implementation。

This sounds reasonable.

@pgh-pamela
Copy link

Hi all
My tcmu-runner version is : 1.5.4.8, when I map 80 luns, at first, everything is normal, but after running for a few hours, the kernel will print the following logs, and tcmu-runner does not print any logs(I set the log_level=5)
image

After that, I unmap the lun, the first lun unmap success, but the second lun cann't be deleted. and it will hang on the lun.delete() (ceph-iscsi), then I restart tcmu-runner,it will work success again.
image

So, I think the problem is interaction between kernel and tcmu-runner.
Have any of you encountered similar problems?
Or do you known how the tcmu-runner influence the error "iSCSI Login negotiation failed"?

@zhangdaolong
Copy link
Author

Hi all My tcmu-runner version is : 1.5.4.8, when I map 80 luns, at first, everything is normal, but after running for a few hours, the kernel will print the following logs, and tcmu-runner does not print any logs(I set the log_level=5) image

After that, I unmap the lun, the first lun unmap success, but the second lun cann't be deleted. and it will hang on the lun.delete() (ceph-iscsi), then I restart tcmu-runner,it will work success again. image

So, I think the problem is interaction between kernel and tcmu-runner. Have any of you encountered similar problems? Or do you known how the tcmu-runner influence the error "iSCSI Login negotiation failed"?

the erro "iSCSI Login negotiation failed" may be caused by initiator login timeoout ,Can you provide the following information?
1.all tcmu-runner.log
2.pstack tcmu-runner-pid

@pgh-pamela
Copy link

I have lost the tcmu-runner.log
There is only the pstack content, please see the attachment

tcmu-runner_pstack.log

@zhangdaolong
Copy link
Author

I have lost the tcmu-runner.log There is only the pstack content, please see the attachment

tcmu-runner_pstack.log

Sorry, I can't see the problem from the pstack information.
I tried to reproduce the problem, but it didn't succeed.
Can you describe in detail the steps to reproduce this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants