Tests fail on NFS-mounted temp drive #60

jmaslak · 2019-12-22T21:27:25Z

I'm seeing an issue on a NFS-mounted temp directory with t/file-locked.t failing. It does not fail with local disks. I wouldn't be surprised if my locking configuration isn't quite right on this host (nfs4, Linux kernel 4.4.0).

Interesting output (I set the tmp directory manually to an NFS-mounted subdir under my home directory):

t/file-locked.t ..
# Subtest: close_after_write = 0
    ok 1 - no exception forking children and writing to file
    not ok 2 - 10 children exited

    #   Failed test '10 children exited'
    #   at t/file-locked.t line 109.
    #          got: '9'
    #     expected: '10'
    ok 3 - 10812 exited with 0
    ok 4 - 10813 exited with 0
    ok 5 - 10808 exited with 0
    ok 6 - 10806 exited with 0
    ok 7 - 10815 exited with 0
    ok 8 - 10809 exited with 0
    ok 9 - 10807 exited with 0
    ok 10 - 10810 exited with 0
    ok 11 - 10811 exited with 0
    not ok 12 - file contains expected content

    #   Failed test 'file contains expected content'
    #   at t/file-locked.t line 138.
    #     Structures begin differing at:
    #          $got->[9] = '2'
    #     $expected->[9] = '1'
    # 1
    # 1
    # 1
    # 1
    # 1
    # 1
    # 2
    # 2
    # 2
    # 3
    # 2
    # 2
    # 3
    # 3
    # 2
    # 3
    # 3
    # 1
    # 1
    # 2
    # 3
    # 3
    # 2
    # 3
    # 1
    # 2
    # 3
    1..12
    # Looks like you failed 2 tests of 12.
not ok 1 - close_after_write = 0

The text was updated successfully, but these errors were encountered:

autarch · 2019-12-23T03:45:50Z

I'm not surprised that there are issues with locking on NFS. IIRC the normal flock call will not work on NFS. I think the best thing for this particular module is just to add a note to the docs.

jmaslak · 2019-12-23T06:11:30Z

Yes, although my understanding is the kernel version I'm using is supposed to emulate local locking behavior using fcntl(2) byte range locks when flock() is called, at least for nfs4. I wrote some test programs that seem to confirm that flock() (via perl) does work, at least for simple cases (one process appending to a file and grabbing an exclusive lock will block a second process trying to grab an exclusive lock on the same file).

I was able to get this work going back to the old style local-only flock locking behavior (I.E. local_lock=flock, which also requires downgrading to nfs3). I'm still kind of baffled, I'll have to read up on how locks work with nfs4 sometime. 🤷🏻‍♀️ For what I'm doing, I don't need network-aware flock(), so I guess it'll work.

autarch · 2019-12-24T16:44:47Z

Hmm, I really don't know what's going on here. One of the child processes appears to either have not run or to not have exited after 10 seconds.

So there's a couple possibilities:

There was a deadlock and the child is stuck forever, but that would be surprising if there's only one child left.
The child never ran for some reason, but that also seems really weird.
NFS is doing something weird and surprising.

All of which is to say I'm confused.

jmaslak · 2019-12-24T17:47:06Z

The processes are just taking a long time. I was seeing sometimes 8, sometimes 9 processes running right.

The locking is working fine on nfs4...sort of. The locks are properly excluding concurrent access. But they are SLOW. I also validated all the processes start, but it takes a long time sometimes to do the write to the file. If I bump the 10s alarm to 30s in t/file-locked.t, it runs just fine for me (albeit slowly). 20s was usually good enough, but every once in a while caused an issue.

I am not sure why it is so slow - this is a NAS connected via 10G to the server, in the same rack, so there isn't much latency or the like, and - but something makes these locks slow (I'll dig into that sometime, it seems slower than it should be - I wonder how the server manages multiple lock requests). Regardless, the speed issue isn't the module's fault. But in the meantime, maybe just bump the alarm time up? I also tried to drop the number of simultaneous processes down, but that didn't help (I.E. 2 children instead of 10 children still needed the 30s alarm).

autarch · 2019-12-24T17:48:08Z

Maybe the locking emulation that NFS provides is slow? And yes, we could increase the alarm time a bit.

jmaslak mentioned this issue Dec 25, 2019

Improve test resilience on very slow-to-lock filesystems (I.E. nfs4) #61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests fail on NFS-mounted temp drive #60

Tests fail on NFS-mounted temp drive #60

jmaslak commented Dec 22, 2019

autarch commented Dec 23, 2019

jmaslak commented Dec 23, 2019

autarch commented Dec 24, 2019

jmaslak commented Dec 24, 2019

autarch commented Dec 24, 2019

Tests fail on NFS-mounted temp drive #60

Tests fail on NFS-mounted temp drive #60

Comments

jmaslak commented Dec 22, 2019

autarch commented Dec 23, 2019

jmaslak commented Dec 23, 2019

autarch commented Dec 24, 2019

jmaslak commented Dec 24, 2019

autarch commented Dec 24, 2019