Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make check hangs in tests/kern #61

Closed
garlick opened this issue Jun 15, 2020 · 5 comments
Closed

make check hangs in tests/kern #61

garlick opened this issue Jun 15, 2020 · 5 comments

Comments

@garlick
Copy link
Member

garlick commented Jun 15, 2020

Running make check as root down in tests/kern hangs at test t05.
My kernel is 5.4.0-7634-generic (ubuntu 20.04 LTS).

This may be a dup of #23 which was against linux-next in 2015, but wanted to open up a new bug until that is confirmed.

@garlick
Copy link
Member Author

garlick commented Oct 26, 2021

Just focusing in on t05 which is the first failing test, we do get a hang on kernel 5.10.63.

The test script only runs /bin/true and as expected, it has succeeded.

Test output t05.out:

kconjoin: diodmount exited with rc=0
kconjoin: t05 exited with rc=0

and diod log t05.diod contains

diod: P9_TVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_RVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_TAUTH tag 0 afid 0 uname '' aname '/tmp/tmp.vBZM9TQ04J' n_uname 0
diod: P9_RLERROR tag 0 ecode 2
diod: P9_TATTACH tag 0 fid 0 afid -1 uname '' aname '/tmp/tmp.vBZM9TQ04J' n_uname 0
diod: P9_RATTACH tag 0 qid (000000000001fcac 0 'd')
diod: P9_TCLUNK tag 0 fid 0
diod: P9_RCLUNK tag 0
diod: P9_TVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_RVERSION tag 65535 msize 65536 version '9P2000.L'
diod: P9_TATTACH tag 0 fid 0 afid -1 uname 'root' aname '/tmp/tmp.vBZM9TQ04J' n_uname P9_NONUNAME
diod: P9_RATTACH tag 0 qid (000000000001fcac 0 'd')
diod: P9_TGETATTR tag 0 fid 0 request_mask 0x7ff
diod: P9_RGETATTR tag 0 valid 0x7ff qid (000000000001fcac 0 'd') mode 040755 uid 0 gid 0 nlink 2 rdev 0 size 4096 blksize 4096 blocks 8 atime Tue Oct 26 16:52:16 2021 mtime Tue Oct 26 16:52:16 2021 ctime Tue Oct 26 16:52:16 2021 btime X gen X data_version X

gdb says kconjoin is stuck here:

(gdb) bt
#0  0xb6ddb2a8 in __GI___waitpid (pid=pid@entry=18484, 
    stat_loc=stat_loc@entry=0xbee1d218, options=options@entry=0)
    at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0xb6d744ec in do_system (
    line=line@entry=0xbee1d652 "../../diod/diod  -r80 -w81 -c /dev/null -n -d 1 -L t05.diod -e /tmp/tmp.vBZM9TQ04J") at ../sysdeps/posix/system.c:149
#2  0xb6d749c4 in __libc_system (
    line=line@entry=0xbee1d652 "../../diod/diod  -r80 -w81 -c /dev/null -n -d 1 -L t05.diod -e /tmp/tmp.vBZM9TQ04J") at ../sysdeps/posix/system.c:185
#3  0x00010a40 in main (argc=<optimized out>, argv=<optimized out>)
    at kconjoin.c:133

and diod here

(gdb) bt
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x1f52ffc)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x0, cond=0x1f52fd0)
    at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x1f52fd0, mutex=0x0) at pthread_cond_wait.c:655
#3  0x00028bec in np_srv_wait_conncount (srv=0x1f52f18, count=count@entry=1)
    at srv.c:141
#4  0x00012dbc in _service_run (wfdno=-1227925284, rfdno=<optimized out>, 
    mode=SRV_FILEDES) at diod.c:666
#5  main (argc=<optimized out>, argv=<optimized out>) at diod.c:257

which is this function:

/* Block the caller until the server has no active connections,
 * and there have been at least 'count' connections historically.
 */
void
np_srv_wait_conncount(Npsrv *srv, int count)
{
        xpthread_mutex_lock(&srv->lock);
        while (srv->conncount > 0 || srv->connhistory < count) {
                xpthread_cond_wait(&srv->conncountcond, &srv->lock);
        }
        xpthread_mutex_unlock(&srv->lock);
}

the connection count is 1

(gdb) frame 3
#3  0x00028bec in np_srv_wait_conncount (srv=0x1f52f18, count=count@entry=1)
    at srv.c:141
141			xpthread_cond_wait(&srv->conncountcond, &srv->lock);
(gdb) p srv->conncount
$1 = 1

So the kernel does not clunk the mount when the test program completes.

@garlick
Copy link
Member Author

garlick commented Oct 26, 2021

The private namespace established with CLONE_NEWNS appears to be leaking, since it is visible to all in /proc/mounts:

$ cat /proc/mounts|grep 9p
nohost:/tmp/tmp.YRvf1AVI4r /tmp/tmp.kbqx8vsreA 9p rw,sync,dirsync,relatime,debug=1,uname=root,aname=/tmp/tmp.YRvf1AVI4r,access=user,msize=65536,trans=fd,rfd=80,wfd=81 0 0

sudo umount /tmp/tmp.kbqx8vsreA allows the test to complete successfully.

@garlick
Copy link
Member Author

garlick commented Oct 26, 2021

This seems to resolve the issue.

diff --git a/tests/kern/kconjoin.c b/tests/kern/kconjoin.c
index 83f08b0..4e32342 100644
--- a/tests/kern/kconjoin.c
+++ b/tests/kern/kconjoin.c
@@ -114,6 +114,13 @@ main (int argc, char *argv[])
             _movefd (fromsrv[0], RFDNO);
             if (unshare (CLONE_NEWNS) < 0)
                 err_exit ("unshare");
+            /* Change root propagation to private within this namespace,
+             * as systemd may have mounted root with it set to shared,
+             * and then the 9p mount will leak into the main namespace and
+             * not be automatically unmounted when the test completes.
+             */
+            system ("mount --make-private /");
+
             if ((cs = system (mntcmd)) == -1)
                 err_exit ("failed to run %s", _cmd (mntcmd));
             if (_interpret_status (cs, _cmd (mntcmd)))

@garlick
Copy link
Member Author

garlick commented Jul 6, 2024

Still some cleanup issues with that fix applied. After running the test I get

$ df
df: /tmp/tmp.vExWhjHnlw: Input/output error
df: /tmp/tmp.xsXEoaK3x6: Input/output error
df: /tmp/tmp.BUHhy3kS6C: Input/output error
df: /tmp/tmp.EVMsv8hhgf: Input/output error
df: /tmp/tmp.orlCjxATgT: Input/output error
df: /tmp/tmp.F8bpFfCXZL: Input/output error
df: /tmp/tmp.REyd5kqrJh: Input/output error
df: /tmp/tmp.wF4OUnufwA: Input/output error
df: /tmp/tmp.cMaEsks0sl: Input/output error
df: /tmp/tmp.AaFdxFdXw2: Input/output error
df: /tmp/tmp.Hx98fIqnQH: Input/output error
df: /tmp/tmp.rTdvkUeHyg: Input/output error
df: /tmp/tmp.SGGAtLbvHa: Input/output error
df: /tmp/tmp.iii51W8iiM: Input/output error
df: /tmp/tmp.8NZQKsmVQj: Input/output error
df: /tmp/tmp.FavfPS4Ffx: Input/output error
df: /tmp/tmp.ouYPdkmvU5: Input/output error
df: /tmp/tmp.LLXGbJAnaN: Input/output error

@garlick
Copy link
Member Author

garlick commented Dec 29, 2024

This is likely because the propgation was only set for root but /tmp is a separate file system. The propagation change should be made recursively, e.g. as unshare(1) now does by default:

https://github.com/util-linux/util-linux/blob/master/sys-utils/unshare.c#L56-L57

@mergify mergify bot closed this as completed in d4401fc Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant