Skip to content

Commit

Permalink
Merge branch 'dev' into workflows
Browse files Browse the repository at this point in the history
  • Loading branch information
mavaylon1 authored Nov 1, 2023
2 parents f68d999 + 70bf35b commit 7cf41ab
Show file tree
Hide file tree
Showing 7 changed files with 164 additions and 33 deletions.
14 changes: 9 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,24 @@
# HDMF-ZARR Changelog

## 0.4.0 (Upcoming)
## 0.4.0 (October 3, 2023)

### Enhancements
* Enhanced ZarrIO to resolve object references lazily on read similar to HDMF's `HDF5IO` backend @mavaylon1 [#120](https://github.com/hdmf-dev/hdmf-zarr/pull/120)
* Enhanced ZarrIO to resolve object references lazily on read similar to HDMF's `HDF5IO` backend. @mavaylon1 [#120](https://github.com/hdmf-dev/hdmf-zarr/pull/120)
* Updated storage of references to also save the ``object_id`` and ``source_object_id``. While not strictly necessary this information is useful for validation and rigor of references. @oruebel [#57](https://github.com/hdmf-dev/hdmf-zarr/pull/57)

### New Features
* Added parallel write support for the ``ZarrIO`` for datasets wrapped with ``GenericDataChunkIterator``. @CodyCBakerPhD [#118](https://github.com/hdmf-dev/hdmf-zarr/pull/118) @oruebel [#128](https://github.com/hdmf-dev/hdmf-zarr/pull/128)

### Dependencies
* Updated HDMF and PyNWB version to the most recent release @mavaylon1 [#120](https://github.com/hdmf-dev/hdmf-zarr/pull/120)
* Updated minimum Python version from 3.7 to 3.8 @mavaylon1 [#120](https://github.com/hdmf-dev/hdmf-zarr/pull/120)
* Added ``threadpoolctl`` dependency for parallel write. @CodyCBakerPhD [#118](https://github.com/hdmf-dev/hdmf-zarr/pull/118)
* Added support for specifying optional dependencies and add optional dependency for ``tqdm`` used in parallel write. @CodyCBakerPhD [#118](https://github.com/hdmf-dev/hdmf-zarr/pull/118)

### Bug fixes
* Fixed error in deploy workflow. @mavaylon1 [#109](https://github.com/hdmf-dev/hdmf-zarr/pull/109)
* Fixed build error for ReadtheDocs by degrading numpy for python 3.7 support. @mavaylon1 [#115](https://github.com/hdmf-dev/hdmf-zarr/pull/115)

### New Features
* Added parallel write support for the ``ZarrIO``. @CodyCBakerPhD [#118](https://github.com/hdmf-dev/hdmf-zarr/pull/118)
* Fixed bug in the resolution of references to Containers. Previously references were only resolved to Builders. @mavaylon1 [#120](https://github.com/hdmf-dev/hdmf-zarr/pull/120)


## 0.3.0 (July 21, 2023)
Expand Down
10 changes: 6 additions & 4 deletions docs/source/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,20 @@ Supported features
- Write/Read of basic data types, strings and compound data types
- Chunking
- Compression and I/O filters
- Links.
- Links
- Object references
- Writing/loading namespaces/specifications
- Iterative data write using :py:class:`~hdmf.data_utils.AbstractDataChunkIterator`
- Iterative data write using :py:class:`~hdmf.data_utils.AbstractDataChunkIterator`
- Parallel write with :py:class:`~hdmf.data_utils.GenericDataChunkIterator` (since v0.4)
- Lazy load of datasets
- Lazy load of datasets containing object refernces (since v0.4)

Known Limitations
-----------------

- Support for region references is not yet implemented. See also :ref:`sec-zarr-storage-references-region` for details.
- The Zarr backend is currently experimental and may still change.
- Attributes are stored as JSON documents in Zarr (using the DirectoryStore). As such, all attributes must be JSON serializable. The :py:class:`~hdmf_zarr.backend.ZarrIO` backend attempts to cast types to JSON serializable types as much as possible.
- Currently the :py:class:`~hdmf_zarr.backend.ZarrIO` backend uses Zarr's :py:class:`~zarr.storage.DirectoryStore` only. Other `Zarr stores <https://zarr.readthedocs.io/en/stable/api/storage.html>`_ could be added but will require proper treatment of links and references for those backends as links are not supported in Zarr (see `zarr-python issues #389 <https://github.com/zarr-developers/zarr-python/issues/389>`_.
- Currently the :py:class:`~hdmf_zarr.backend.ZarrIO` backend supports Zarr's directory-based stores :py:class:`~zarr.storage.DirectoryStore`, :py:class:`~zarr.storage.NestedDirectoryStore`, and :py:class:`~zarr.storage.TempStore`. Other `Zarr stores <https://zarr.readthedocs.io/en/stable/api/storage.html>`_ could be added but will require proper treatment of links and references for those backends as links are not supported in Zarr (see `zarr-python issues #389 <https://github.com/zarr-developers/zarr-python/issues/389>`_.
- Exporting of HDF5 files with external links is not yet fully implemented/tested. (see `hdmf-zarr issue #49 <https://github.com/hdmf-dev/hdmf-zarr/issues/49>`_.
- Object references are currently always resolved on read (as are links) rather than being loaded lazily (see `hdmf-zarr issue #50 <https://github.com/hdmf-dev/hdmf-zarr/issues/50>`_.
- Special characters (e.g., ``:``, ``<``, ``>``, ``"``, ``/``, ``\``, ``|``, ``?``, or ``*``) may not be supported by all file systems (e.g., on Windows) and as such should not be used as part of the names of Datasets or Groups as Zarr needs to create folders on the filesystem for these objects.
19 changes: 15 additions & 4 deletions docs/source/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,12 @@ as JSON. Each dict (i.e., element) in the list defines a link, with each dict co
links that point to object in another Zarr file, the value of source will be the path to
the other Zarr file relative to the root path of the Zarr file containing the link.
* ``path`` : Path to the linked object within the Zarr file idenfied by the ``source`` key
* ``object_id``: Object id of the reference object. May be None in case the referenced object
does not have an assigned object_id (e.g., in the case we reference a dataset with a fixed
name but without and assigned ``data_type`` (or ``neurodata_type`` in the case of NWB).
* ``source_object_id``: Object id of the source Zarr file indicated by the ``source`` key.
The ``source`` should always have an ``object_id`` (at least if the ``source`` file is
a valid HDMF formatted file).

For example:

Expand All @@ -183,8 +189,10 @@ For example:
"zarr_link": [
{
"name": "device",
"source": ".",
"path": "/general/devices/array",
"source": "."
"object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
"source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
}
]
Expand Down Expand Up @@ -256,8 +264,9 @@ Object references are stored in a attributes as dicts with the following keys:
* ``zarr_dtype`` : Indicating the data type for the attribute. For object references
``zarr_dtype`` is set to ``"object"`` (or ``"region"`` for :ref:`sec-zarr-storage-references-region`)
* ``value``: The value of the object references, i.e., here the py:class:`~hdmf_zarr.utils.ZarrReference`
dictionary with the ``source`` and ``path`` keys defining the object reference (again, ``source`` is
here the relative path to the target Zarr file, and ``path`` identifys the object within the source Zarr file).
dictionary with the ``source``, ``path``, ``object_id``, and ``source_object_id`` keys defining
the object reference, with the definition of the keys being the same as
for :ref:`sec-zarr-storage-links`.

For example in NWB, the attribute ``ElectricalSeries.electrodes.table`` would be defined as follows:

Expand All @@ -266,7 +275,9 @@ For example in NWB, the attribute ``ElectricalSeries.electrodes.table`` would be
"table": {
"value": {
"path": "/general/extracellular_ephys/electrodes",
"source": "."
"source": ".",
"object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
"source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
},
"zarr_dtype": "object"
}
Expand Down
25 changes: 24 additions & 1 deletion src/hdmf_zarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -671,6 +671,24 @@ def __get_ref(self, ref_object, export_source=None):
# if isinstance(ref_object, RegionBuilder):
# region = ref_object.region

# get the object id if available
object_id = builder.get('object_id', None)

# determine the object_id of the source by following the parents of the builder until we find the root
# the root builder should be the same as the source file containing the reference
curr = builder
while curr is not None and curr.name != ROOT_NAME:
curr = curr.parent
if curr:
source_object_id = curr.get('object_id', None)
# We did not find ROOT_NAME as a parent. This should only happen if we have an invalid
# file as a source, e.g., if during testing we use an arbitrary builder. We check this
# anyways to avoid potential errors just in case
else:
source_object_id = None
warn_msg = "Could not determine source_object_id for builder with path: %s" % path
warnings.warn(warn_msg)

# by checking os.isdir makes sure we have a valid link path to a dir for Zarr. For conversion
# between backends a user should always use export which takes care of creating a clean set of builders.
source = (builder.source
Expand All @@ -687,7 +705,12 @@ def __get_ref(self, ref_object, export_source=None):
else:
source = os.path.relpath(os.path.abspath(source), start=self.abspath)
# Return the ZarrReference object
return ZarrReference(source, path)
ref = ZarrReference(
source=source,
path=path,
object_id=object_id,
source_object_id=source_object_id)
return ref

def __add_link__(self, parent, target_source, target_path, link_name):
"""
Expand Down
50 changes: 39 additions & 11 deletions src/hdmf_zarr/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -476,31 +476,59 @@ class ZarrReference(dict):

@docval({'name': 'source',
'type': str,
'doc': 'Source of referenced object',
'doc': 'Source of referenced object. Usually the relative path to the '
'Zarr file containing the referenced object',
'default': None},
{'name': 'path',
'type': str,
'doc': 'Path of referenced object',
'doc': 'Path of referenced object within the source',
'default': None},
{'name': 'object_id',
'type': str,
'doc': 'Object_id of the referenced object (if available)',
'default': None},
{'name': 'source_object_id',
'type': str,
'doc': 'Object_id of the source (should always be available)',
'default': None}
)
def __init__(self, **kwargs):
dest_source, dest_path = getargs('source', 'path', kwargs)
dest_source, dest_path, dest_object_id, dest_source_object_id = getargs(
'source', 'path', 'object_id', 'source_object_id', kwargs)
super(ZarrReference, self).__init__()
super(ZarrReference, self).__setitem__('source', dest_source)
super(ZarrReference, self).__setitem__('path', dest_path)
self.source = dest_source
self.path = dest_path
self.object_id = dest_object_id
self.source_object_id = dest_source_object_id

@property
def source(self):
def source(self) -> str:
return super(ZarrReference, self).__getitem__('source')

@property
def path(self):
def path(self) -> str:
return super(ZarrReference, self).__getitem__('path')

@property
def object_id(self) -> str:
return super(ZarrReference, self).__getitem__('object_id')

@property
def source_object_id(self) -> str:
return super(ZarrReference, self).__getitem__('source_object_id')

@source.setter
def source(self, s):
super(ZarrReference, self).__setitem__('source', s)
def source(self, source: str):
super(ZarrReference, self).__setitem__('source', source)

@path.setter
def path(self, p):
super(ZarrReference, self).__setitem__('path', p)
def path(self, path: str):
super(ZarrReference, self).__setitem__('path', path)

@object_id.setter
def object_id(self, object_id: str):
super(ZarrReference, self).__setitem__('object_id', object_id)

@source_object_id.setter
def source_object_id(self, object_id: str):
super(ZarrReference, self).__setitem__('source_object_id', object_id)
67 changes: 61 additions & 6 deletions tests/unit/base_tests_zarrio.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
# Try to import Zarr and disable tests if Zarr is not available
import zarr
from hdmf_zarr.backend import ZarrIO
from hdmf_zarr.utils import ZarrDataIO
from hdmf_zarr.utils import ZarrDataIO, ZarrReference
from tests.unit.utils import (Baz, BazData, BazBucket, get_baz_buildmanager)

# Try to import numcodecs and disable compression tests if it is not available
try:
Expand Down Expand Up @@ -290,6 +291,41 @@ def test_write_reference(self):
writer.write_builder(builder)
writer.close()

def test_write_references_roundtrip(self):
# Setup a file container with references
num_bazs = 10
bazs = [] # set up dataset of references
for i in range(num_bazs):
bazs.append(Baz(name='baz%d' % i))
baz_data = BazData(name='baz_data', data=bazs)
container = BazBucket(bazs=bazs, baz_data=baz_data)
manager = get_baz_buildmanager()
# write to file
with ZarrIO(self.store, manager=manager, mode='w') as writer:
writer.write(container=container)
# read from file and validate references
with ZarrIO(self.store, manager=manager, mode='r') as reader:
read_container = reader.read()
for i in range(num_bazs):
baz_name = 'baz%d' % i
expected_container = read_container.bazs[baz_name]
expected_value = {'source': '.',
'path': '/bazs/' + baz_name,
'object_id': expected_container.object_id,
'source_object_id': read_container.object_id}
# Read the dict with the definition of the reference from the raw Zarr file and compare
# to also check that reference (included object id's) are defined correctly
self.assertDictEqual(reader.file['baz_data'][i], expected_value)
# Also test using the low-level reference functions
zarr_ref = ZarrReference(**expected_value)
# Check the ZarrReference first
self.assertEqual(zarr_ref.object_id, expected_value['object_id'])
self.assertEqual(zarr_ref.source_object_id, expected_value['source_object_id'])
# Check that the ZarReference is being resolved via the ZarrIO.resolve_ref
target_name, target_zarr_obj = reader.resolve_ref(zarr_ref)
self.assertEqual(target_name, baz_name)
self.assertEqual(target_zarr_obj.attrs['object_id'], expected_container.object_id)

def test_write_reference_compound(self):
builder = self.createReferenceCompoundBuilder()
writer = ZarrIO(self.store, manager=self.manager, mode='a')
Expand Down Expand Up @@ -578,8 +614,14 @@ def test_write_attributes_write_reference_to_datasetbuilder(self):
tempIO = ZarrIO(self.store, mode='w')
tempIO.open()
attr = {'attr1': dataset_1}
tempIO.write_attributes(obj=tempIO.file, attributes=attr)
expected_value = {'attr1': {'zarr_dtype': 'object', 'value': {'source': ".", 'path': '/dataset_1'}}}
with self.assertWarnsWith(UserWarning,
"Could not determine source_object_id for builder with path: /dataset_1"):
tempIO.write_attributes(obj=tempIO.file, attributes=attr)
expected_value = {'attr1': {'zarr_dtype': 'object',
'value': {'source': ".",
'path': '/dataset_1',
'object_id': None,
'source_object_id': None}}}
self.assertDictEqual(tempIO.file.attrs.asdict(), expected_value)
tempIO.close()

Expand All @@ -590,8 +632,16 @@ def test_write_attributes_write_reference_to_referencebuilder(self):
tempIO = ZarrIO(self.store, mode='w')
tempIO.open()
attr = {'attr1': ref1}
tempIO.write_attributes(obj=tempIO.file, attributes=attr)
expected_value = {'attr1': {'zarr_dtype': 'object', 'value': {'source': ".", 'path': '/dataset_1'}}}
with self.assertWarnsWith(UserWarning,
"Could not determine source_object_id for builder with path: /dataset_1"):
tempIO.write_attributes(obj=tempIO.file, attributes=attr)
expected_value = {'attr1': {'zarr_dtype': 'object',
'value': {'source': ".",
'path': '/dataset_1',
'source_object_id': None,
'object_id': None},
}
}
self.assertDictEqual(tempIO.file.attrs.asdict(), expected_value)
tempIO.close()

Expand Down Expand Up @@ -1027,9 +1077,14 @@ def test_soft_link_group(self):
foofile = FooFile(buckets=[foobucket], foo_link=foo1)
with ZarrIO(self.store[0], manager=get_foo_buildmanager(), mode='w') as write_io:
write_io.write(foofile)

# with open(self.paths[0]+"/.zattrs", 'r') as f:
# print(f.readlines())

with ZarrIO(self.store[0], manager=get_foo_buildmanager(), mode='r') as read_io:
with ZarrIO(self.store[1], mode='w') as export_io:
export_io.export(src_io=read_io, write_args=dict(link_data=False))
export_io.export(src_io=read_io,
write_args=dict(link_data=False))
with ZarrIO(self.store[1], manager=get_foo_buildmanager(), mode='r') as read_io:
read_foofile2 = read_io.read()
# make sure the linked group is within the same file
Expand Down
12 changes: 10 additions & 2 deletions tests/unit/test_parallel_write.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,12 @@ def test_simple_tqdm(tmpdir):
display_progress=True,
)
)
dynamic_table = DynamicTable(name="TestTable", description="", columns=[column])
dynamic_table = DynamicTable(
name="TestTable",
description="",
columns=[column],
id=list(range(3)) # must provide id's when all columns are iterators
)
io.write(container=dynamic_table, number_of_jobs=number_of_jobs)

assert expected_desc in tqdm_out.getvalue()
Expand Down Expand Up @@ -222,7 +227,10 @@ def test_compound_tqdm(tmpdir):
)
)
dynamic_table = DynamicTable(
name="TestTable", description="", columns=[pickleable_column, not_pickleable_column]
name="TestTable",
description="",
columns=[pickleable_column, not_pickleable_column],
id=list(range(3)) # must provide id's when all columns are iterators
)
io.write(container=dynamic_table, number_of_jobs=number_of_jobs)

Expand Down

0 comments on commit 7cf41ab

Please sign in to comment.