Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hdf5 response format #1292

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
e2d8010
Added support for returning optimade data in the hdf5 format.
JPBergsma Jul 28, 2022
079bd71
Added extra doctstrings to hdf5.py and made setting for enabling/disa…
JPBergsma Jul 29, 2022
0b71e9e
Added dependancies for hdf5 response to requirements.txt and setup.py.
JPBergsma Jul 29, 2022
9167351
Added enabled_response_formats to test config and disabled hdf5 tests…
JPBergsma Jul 29, 2022
7551132
Added enabled_response_formats to test config and disabled hdf5 tests…
JPBergsma Jul 29, 2022
e43297e
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Jul 29, 2022
d811457
merges changes from master.
JPBergsma Jul 29, 2022
7952092
checking whether the not installing of numpy on github server was cau…
JPBergsma Jul 29, 2022
694894f
added hdf5_deps to extras_require.
JPBergsma Jul 29, 2022
8d51f55
Added numpy and h5py to install_requirements in setup.py
JPBergsma Jul 29, 2022
12b79e0
Use a query that does not have an _exampl_ field to test response for…
JPBergsma Jul 29, 2022
9fe4dcc
Added extra test and the supported response formats are now listed at…
JPBergsma Aug 3, 2022
1981032
Made some changes to the docstrings and type definitions so it will h…
JPBergsma Aug 4, 2022
79b48d6
The test for the single entry point did not work. This is fixed now
JPBergsma Aug 4, 2022
687ea78
Added more thorough check to see whetehr the response contnet type is…
JPBergsma Aug 4, 2022
fbfe0f7
Remove numpy and h5py from 'install_requires'.
JPBergsma Aug 4, 2022
a55bd82
Revert "Remove numpy and h5py from 'install_requires'."
JPBergsma Aug 4, 2022
43e326f
Remove h5py_deps and put numpy and h5py back in install_requires.
JPBergsma Aug 4, 2022
1e7e3f9
Processed comments from code review.
JPBergsma Aug 9, 2022
50cacf0
Fixed test_response_format.py
JPBergsma Aug 9, 2022
82f2b31
Added extra test values, and added support for handling nested lists …
JPBergsma Aug 9, 2022
15770f9
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Aug 9, 2022
42864cb
Added extra test to check if response_format is in the enabled_respon…
JPBergsma Aug 10, 2022
7c6a562
Merge branch 'JPBergsma/add_HDF5_output_format' of https://github.com…
JPBergsma Aug 10, 2022
30af05a
Added filenames to the header.
JPBergsma Aug 15, 2022
47fa9ad
Changed the way the collection name is determined for the file name o…
JPBergsma Aug 16, 2022
9ef6b05
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Sep 15, 2022
4ada284
Update requirements.txt
JPBergsma Sep 15, 2022
f1c309d
updated version requirement numpy in requirements.txt
JPBergsma Sep 18, 2022
b32278f
Small fields are now stored as attributes rather than datasets.
JPBergsma Sep 21, 2022
9597cca
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Sep 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/api_reference/adapters/hdf5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# hdf5

::: optimade.adapters.hdf5
268 changes: 268 additions & 0 deletions optimade/adapters/hdf5.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
from io import BytesIO
from typing import Union, Any
from pydantic import AnyUrl
from datetime import datetime, timezone
from optimade.models import EntryResponseMany, EntryResponseOne
import h5py
from sys import getsizeof
import numpy as np


"""This adaptor can be used to generate a hdf5 response instead of a json response and to convert the hdf5 response back into an python dictionary.
It can handle numeric data in a binary format compatible with numpy.
It is therefore more efficient than the JSON format at returning large amounts of numeric data.
It however also has more overhead resulting in a larger response for entries with little numeric data.
To enable support for your server the parameter "enabled_response_formats" can be specified in the config file.
It is a list of the supported response_formats. To support the hdf5 return format it should be set to: ["json", "hdf5"]
(support for the JSON format is mandatory)

Unfortunately, h5py does not support storing objects with the numpy.object type.
It is therefore not possible to directly store a list of dictionaries in a hdf5 file with h5py.
As a workaround, the index of a value in a list is used as a dictionary key so a list can be stored as a dictionary if neccesary.

It also assumes that all the elements of a list, tuple or numpy array are of the same type.
"""


def generate_hdf5_file_content(
response_object: Union[EntryResponseMany, EntryResponseOne, dict, list, tuple]
) -> bytes:
"""This function generates the content of a hdf5 file from an EntryResponse object.
It should also be able to handle python dictionaries lists and tuples.

Parameters:
response_object: an OPTIMADE response object. This can be of any OPTIMADE entry type, such as structure, reference etc.

Returns:
A binary object containing the contents of the hdf5 file.
"""

temp_file = BytesIO()
hdf5_file = h5py.File(temp_file, "w")
if isinstance(response_object, (EntryResponseMany, EntryResponseOne)):
response_object = response_object.dict(exclude_unset=True)
store_hdf5_dict(hdf5_file, response_object)
hdf5_file.close()
file_content = temp_file.getvalue()
temp_file.close()
return file_content


def store_hdf5_dict(
hdf5_file: h5py._hl.files.File, iterable: Union[dict, list, tuple], group: str = "/"
):
"""This function stores a python list, dictionary or tuple in a hdf5 file.
the currently supported datatypes are str, int, float, list, dict, tuple, bool, AnyUrl,
None ,datetime or any numpy type or numpy array.

Unfortunately, h5py does not support storing objects with the numpy.object type.
It is therefore not possible to directly store a list of dictionaries in a hdf5 file with h5py.
As a workaround, the index of a value in a list is used as a dictionary key so a list can be stored as a dictionary if neccesary.

Parameters:
hdf5_file: An hdf5 file like object.
iterable: The object to be stored in the hdf5 file.
group: This indicates to group in the hdf5 file the list, tuple or dictionary should be added.

Raises:
TypeError: If this function encounters an object with a type that it cannot convert to the hdf5 format
a ValueError is raised.
"""
if isinstance(iterable, (list, tuple)):
iterable = enumerate(iterable)
elif isinstance(iterable, dict):
iterable = iterable.items()
for x in iterable:
key = str(x[0])
value = x[1]
if isinstance(
value, (list, tuple)
): # For now, I assume that all values in the list have the same type.
if len(value) < 1: # case empty list
store_value_in_hdf5(key, value, group, hdf5_file)
continue
val_type = type(value[0])
if isinstance(value[0], dict):
hdf5_file.create_group(group + "/" + key)
store_hdf5_dict(hdf5_file, value, group + "/" + key)
elif val_type.__module__ == np.__name__:
try:
store_value_in_hdf5(key, value, group, hdf5_file)
except TypeError as hdf5_error:
raise TypeError(
"Unfortunatly more complex numpy types like object can not yet be stored in hdf5. Error from hdf5:"
+ hdf5_error
)
elif isinstance(value[0], (int, float)):
store_value_in_hdf5(key, np.asarray(value), group, hdf5_file)
elif isinstance(value[0], str):
# Here I can pass a list of strings to hdf5 which is stored as a numpy object.
store_value_in_hdf5(key, value, group, hdf5_file)
elif isinstance(value[0], (list, tuple)):
list_type = get_recursive_type(value[0])
if list_type in (int, float):
store_value_in_hdf5(key, np.asarray(value), group, hdf5_file)
else:
hdf5_file.create_group(group + "/" + key)
store_hdf5_dict(hdf5_file, value, group + "/" + key)
else:
hdf5_file.create_group(group + "/" + key)
store_hdf5_dict(hdf5_file, value, group + "/" + key)

elif isinstance(value, dict):
hdf5_file.create_group(group + "/" + key)
store_hdf5_dict(hdf5_file, value, group + "/" + key)
elif isinstance(value, bool):
store_value_in_hdf5(key, np.bool_(value), group, hdf5_file)
elif isinstance(value, AnyUrl):
# This case had to be placed above the str case as AnyUrl inherits from the string class, but cannot be handled directly by h5py.
store_value_in_hdf5(key, str(value), group, hdf5_file)
elif isinstance(value, (int, float, str)):
store_value_in_hdf5(key, value, group, hdf5_file)

elif type(value).__module__ == np.__name__:
try:
store_value_in_hdf5(key, value, group, hdf5_file)
except TypeError as hdf5_error:
raise TypeError(
f"Unfortunatly more complex numpy types like object can not yet be stored in hdf5. Error from hdf5:{hdf5_error}"
)
elif isinstance(value, datetime):
store_value_in_hdf5(
key,
value.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
group,
hdf5_file,
)
elif value is None:
store_value_in_hdf5(key, h5py.Empty("f"), group, hdf5_file)
else:
raise ValueError(
f"Unable to store a value of type: {type(value)} in hdf5 format."
)


def store_value_in_hdf5(key, value, group, hdf5_file):
compression_level = 1
if (
getsizeof(value) < 4096
): # small properties can be sored as attributes the value of 4096 is rather arbitrary. The total of all the properties should however not exceed 64 kb.
if (
group
): # if a group is already present we can store small properties as attributes. (It seems that for each group /dataset a 64kb header is made causing the files to become very large.)
hdf5_file[group].attrs[key] = value
else:
hdf5_file[group + "/" + key] = value
else:
hdf5_file.create_dataset(
group + "/" + key,
data=value,
compression="gzip",
compression_opts=compression_level,
)


def get_recursive_type(obj: Any) -> type:
"""If obj is a list or tuple this function returns the type of the first object in the list/tuple that is not a list
or tuple. If the list or tuple is empty it returns None.
Finally if the object is not a list or tuple it returns the type of the object.

Parameters:
obj: any python object

Returns:
The type of the objects that the object contains or the type of the object itself when it does not contain other objects."""

if isinstance(obj, (list, tuple)):
if len(obj) == 0:
return None
else:
if isinstance(obj[0], (list, tuple)):
return get_recursive_type(obj[0])
else:
return type(obj[0])
return type(obj)


def generate_response_from_hdf5(hdf5_content: bytes) -> dict:
"""Generates a response_dict from a HDF5 file like object.
It is similar to the response_dict generated from the JSON response, except that the numerical data will have numpy
types.

Parameters:
hdf5_content: the content of a hdf5 file.

Returns:
A dictionary containing the data of the hdf5 file."""

temp_file = BytesIO(hdf5_content)
hdf5_file = h5py.File(temp_file, "r")
response_dict = generate_dict_from_hdf5(hdf5_file)
return response_dict


def generate_dict_from_hdf5(
hdf5_file: h5py._hl.files.File, group: str = "/"
) -> Union[dict, list]:
"""This function returns the content of a hdf5 group.
Because of the workaround described under the store_hdf5_dict function, groups which have numbers as keys will be turned to lists(No guartee that the order is the same as in th eoriginal list).
Otherwise, the group will be turned into a dict.

Parameters:
hdf5_file: An HDF5_object containing the data that should be converted to a dictionary or list.
group: The hdf5 group for which the dictionary should be created. The default is "/" which will return all the data in the hdf5_object

Returns:
A dict or list containing the content of the hdf5 group.
"""

return_value = None
for key, value in hdf5_file[group].items():
return_value = inside_generate_dict_from_hdf5(
key, value, return_value, group, hdf5_file
)
for key, value in hdf5_file[group].attrs.items():
return_value = inside_generate_dict_from_hdf5(
key, value, return_value, group, hdf5_file
)
return return_value


def inside_generate_dict_from_hdf5(key, value, return_value, group, hdf5_file):
if key.isdigit():
if return_value is None:
return_value = []
if isinstance(value, h5py._hl.group.Group):
return_value.append(
generate_dict_from_hdf5(hdf5_file, group=group + key + "/")
)
elif isinstance(value, h5py._hl.base.Empty):
return_value.append(None)
elif isinstance(value, str):
return_value.append(value)
elif isinstance(value[()], h5py._hl.base.Empty):
return_value.append(None)
elif isinstance(value[()], bytes):
return_value.append(value[()].decode())
else:
return_value.append(value[()])

else: # Case dictionary
if return_value is None:
return_value = {}
if isinstance(value, h5py._hl.group.Group):
return_value[key] = generate_dict_from_hdf5(
hdf5_file, group=group + key + "/"
)
elif isinstance(value, h5py._hl.base.Empty):
return_value[key] = None
elif isinstance(value, str):
return_value[key] = value
elif isinstance(value[()], h5py._hl.base.Empty):
return_value[key] = None
elif isinstance(value[()], bytes):
return_value[key] = value[()].decode()
else:
return_value[key] = value[()]

return return_value
14 changes: 14 additions & 0 deletions optimade/models/jsonapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
parse_obj_as,
root_validator,
)
import numpy
from optimade.models.utils import StrictField


Expand Down Expand Up @@ -319,6 +320,13 @@ class Resource(BaseResource):
)


def process_ndarray(arg):
if arg.dtype == object:
return arg.astype(str).tolist()
else:
return arg.tolist()


class Response(BaseModel):
"""A top-level response"""

Expand Down Expand Up @@ -365,4 +373,10 @@ class Config:
datetime: lambda v: v.astimezone(timezone.utc).strftime(
"%Y-%m-%dT%H:%M:%SZ"
),
numpy.int32: lambda v: int(v),
numpy.float32: lambda v: float(v),
numpy.int64: lambda v: int(v),
numpy.float64: lambda v: float(v),
numpy.bool_: lambda v: bool(v),
numpy.ndarray: process_ndarray,
}
19 changes: 19 additions & 0 deletions optimade/server/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,18 @@ class SupportedBackend(Enum):
MONGOMOCK = "mongomock"


class SupportedResponseFormats(Enum):
"""Enumeration of supported database backends

- 'JSON': [JSON](https://www.json.org/json-en.html)
- 'HDF5': [HDF5](https://portal.hdfgroup.org/display/HDF5/HDF5)

"""

HDF5 = "hdf5"
JSON = "json"


def config_file_settings(settings: BaseSettings) -> Dict[str, Any]:
"""Configuration file settings source.

Expand Down Expand Up @@ -291,6 +303,10 @@ class ServerConfig(BaseSettings):
True,
description="If True, the server will check whether the query parameters given in the request are correct.",
)
enabled_response_formats: Optional[List[SupportedResponseFormats]] = Field(
["json"],
description="""A list of the response formats that are supported by this server. Must include the "json" format.""",
)

@validator("implementation", pre=True)
def set_implementation_version(cls, v):
Expand Down Expand Up @@ -318,6 +334,9 @@ def use_real_mongo_override(cls, values):

return values

def get_enabled_response_formats(self):
return [e.value for e in self.enabled_response_formats]

class Config:
"""
This is a pydantic model Config object that modifies the behaviour of
Expand Down
4 changes: 2 additions & 2 deletions optimade/server/entry_collections/entry_collections.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,10 +301,10 @@ def handle_query_params(
# response_format
if (
getattr(params, "response_format", False)
and params.response_format != "json"
and params.response_format not in CONFIG.get_enabled_response_formats()
):
raise BadRequest(
detail=f"Response format {params.response_format} is not supported, please use response_format='json'"
detail=f"Response format {params.response_format} is not supported, please use one of the supported response_formats: {','.join(CONFIG.get_enabled_response_formats())}"
)

# page_limit
Expand Down
8 changes: 7 additions & 1 deletion optimade/server/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,7 +445,13 @@ async def dispatch(self, request: Request, call_next):
if not isinstance(chunk, bytes):
chunk = chunk.encode(charset)
body += chunk
body = body.decode(charset)
for i in range(len(response.raw_headers)):
if (
response.raw_headers[i][0] == b"content-type"
and response.raw_headers[i][1] == b"application/vnd.api+json"
):
body = body.decode(charset)
break

if self._warnings:
response = json.loads(body)
Expand Down
Loading