Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail on NFS-mounted temp drive #60

Open
jmaslak opened this issue Dec 22, 2019 · 5 comments
Open

Tests fail on NFS-mounted temp drive #60

jmaslak opened this issue Dec 22, 2019 · 5 comments

Comments

@jmaslak
Copy link
Contributor

jmaslak commented Dec 22, 2019

I'm seeing an issue on a NFS-mounted temp directory with t/file-locked.t failing. It does not fail with local disks. I wouldn't be surprised if my locking configuration isn't quite right on this host (nfs4, Linux kernel 4.4.0).

Interesting output (I set the tmp directory manually to an NFS-mounted subdir under my home directory):

t/file-locked.t ..
# Subtest: close_after_write = 0
    ok 1 - no exception forking children and writing to file
    not ok 2 - 10 children exited

    #   Failed test '10 children exited'
    #   at t/file-locked.t line 109.
    #          got: '9'
    #     expected: '10'
    ok 3 - 10812 exited with 0
    ok 4 - 10813 exited with 0
    ok 5 - 10808 exited with 0
    ok 6 - 10806 exited with 0
    ok 7 - 10815 exited with 0
    ok 8 - 10809 exited with 0
    ok 9 - 10807 exited with 0
    ok 10 - 10810 exited with 0
    ok 11 - 10811 exited with 0
    not ok 12 - file contains expected content

    #   Failed test 'file contains expected content'
    #   at t/file-locked.t line 138.
    #     Structures begin differing at:
    #          $got->[9] = '2'
    #     $expected->[9] = '1'
    # 1
    # 1
    # 1
    # 1
    # 1
    # 1
    # 2
    # 2
    # 2
    # 3
    # 2
    # 2
    # 3
    # 3
    # 2
    # 3
    # 3
    # 1
    # 1
    # 2
    # 3
    # 3
    # 2
    # 3
    # 1
    # 2
    # 3
    1..12
    # Looks like you failed 2 tests of 12.
not ok 1 - close_after_write = 0
@autarch
Copy link
Member

autarch commented Dec 23, 2019

I'm not surprised that there are issues with locking on NFS. IIRC the normal flock call will not work on NFS. I think the best thing for this particular module is just to add a note to the docs.

@jmaslak
Copy link
Contributor Author

jmaslak commented Dec 23, 2019

Yes, although my understanding is the kernel version I'm using is supposed to emulate local locking behavior using fcntl(2) byte range locks when flock() is called, at least for nfs4. I wrote some test programs that seem to confirm that flock() (via perl) does work, at least for simple cases (one process appending to a file and grabbing an exclusive lock will block a second process trying to grab an exclusive lock on the same file).

I was able to get this work going back to the old style local-only flock locking behavior (I.E. local_lock=flock, which also requires downgrading to nfs3). I'm still kind of baffled, I'll have to read up on how locks work with nfs4 sometime. 🤷🏻‍♀️ For what I'm doing, I don't need network-aware flock(), so I guess it'll work.

@autarch
Copy link
Member

autarch commented Dec 24, 2019

Hmm, I really don't know what's going on here. One of the child processes appears to either have not run or to not have exited after 10 seconds.

So there's a couple possibilities:

  • There was a deadlock and the child is stuck forever, but that would be surprising if there's only one child left.
  • The child never ran for some reason, but that also seems really weird.
  • NFS is doing something weird and surprising.

All of which is to say I'm confused.

@jmaslak
Copy link
Contributor Author

jmaslak commented Dec 24, 2019

The processes are just taking a long time. I was seeing sometimes 8, sometimes 9 processes running right.

The locking is working fine on nfs4...sort of. The locks are properly excluding concurrent access. But they are SLOW. I also validated all the processes start, but it takes a long time sometimes to do the write to the file. If I bump the 10s alarm to 30s in t/file-locked.t, it runs just fine for me (albeit slowly). 20s was usually good enough, but every once in a while caused an issue.

I am not sure why it is so slow - this is a NAS connected via 10G to the server, in the same rack, so there isn't much latency or the like, and - but something makes these locks slow (I'll dig into that sometime, it seems slower than it should be - I wonder how the server manages multiple lock requests). Regardless, the speed issue isn't the module's fault. But in the meantime, maybe just bump the alarm time up? I also tried to drop the number of simultaneous processes down, but that didn't help (I.E. 2 children instead of 10 children still needed the 30s alarm).

@autarch
Copy link
Member

autarch commented Dec 24, 2019

Maybe the locking emulation that NFS provides is slow? And yes, we could increase the alarm time a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants