Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many files opened when reading HDF5 Virtual Datasets #52

Open
t20100 opened this issue Nov 10, 2021 · 0 comments
Open

Too many files opened when reading HDF5 Virtual Datasets #52

t20100 opened this issue Nov 10, 2021 · 0 comments

Comments

@t20100
Copy link
Member

t20100 commented Nov 10, 2021

If the limitation of number of files opened is lower than the number of files to open to read a virtual dataset, some of the data is not read (and silently replaced with the fillvalue.

This patch adds a test to illustrate this (provided the limit of max opened file is lower than 2000, check with ulimit -Sn):

--- a/test/base_test.py
+++ b/test/base_test.py
@@ -255,3 +255,25 @@ class BaseTestEndpoints:
                 "target_path": "not_an_entity",
                 "type": "soft_link",
             }
+
+
+    def test_vds(self, server):
+        # Create a file with too many files in a VDS
+        filename = "test.h5"
+        path = "data"
+        nfiles = 2000
+        ndata = 10
+
+        layout = h5py.VirtualLayout(shape=(nfiles, ndata), dtype=np.uint8)
+        for n in range(nfiles):
+            fname = server.served_directory / f"{n}.h5"
+            with h5py.File(fname, "w") as h5file:
+                h5file['data'] = np.ones((ndata,), dtype=np.uint8)
+            layout[n] = h5py.VirtualSource(fname, 'data', shape=(ndata,))
+
+        with h5py.File(server.served_directory / filename, mode="w") as h5file:
+            h5file.create_virtual_dataset('data', layout, fillvalue=0)
+
+        response = server.get(f"/data/?file={str(filename)}&path={path}&format=npy")
+        content = decode_response(response, format="npy")
+        assert np.all(content != 0)

On way to workaround this is to increase the limit on the host (ulimit -Sn unlimited), another is to do the same from python using resource with something like: https://github.com/silx-kit/silx/blob/bd3b283f2cf03f145a34caaaadec539146c448bb/src/silx/app/view/main.py#L89-L102

A neater one would be for libhdf5 to take care of this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant