A collection of type annotations specifiers for Python types common in the fields of scientific computing, data science and machine learning. Most types come with a pair of JSON serializer/deserializer functions, which can be used both to archive and transfer data. In contrast to pickling, the resulting JSON data is both future safe and human readable, which makes especially useful for reproducible research.
Some supported data types (see the documentation for a full list):
range
- NumPy arrays1
- NumPy random number generators
- SciPy statistical distributions
- Pure functions
- PyTorch tensors
- Values with units (Pint)
More importantly, scityping is built upon a simple yet flexible mechanism for defining custom JSON serializers and associating them to preexisting types. This means that any Python type which is be serializable, is also serializable in practice. So you can take an existing analysis pipeline, and make the input or output of any intermediate step exportable to JSON – without ever touching the pipeline itself !
Types were designed and tested with Pydantic in mind, so they can be used as types in Pydantic BaseModels
. To support serialization, use
from scityping.pydantic import BaseModel
This is just a small wrapper class around pydantic.BaseModel
which adds a hook for our extensible type machinery.
Example usage:
from scityping.numpy import Array
from scityping.pydantic import BaseModel
class MyModel(BaseModel):
data: Array[float, 1] # 1D array of floats
model = MyModel(data=[1, 2, 3])
print(model.data)
# array([1., 2., 3.,], dtype=float)
json_data = model.json()
print(json_data)
# {"data": ["scityping.numpy._ArrayType", {"data": {"data": [1.0, 2.0, 3.0], "dtype": ["scityping.numpy.DType", {"desc": "float64"}]}}]}
model2 = MyModel.parse_raw(json_data)
print(model2.data)
# array([1., 2., 3.,], dtype=float)
Serializers are class methods, so they can also be used by accessing them directly.
import json
import numpy as np
from scityping import Serializable
from scityping.json import scityping_encoder
x = np.array([1, 1, 2, 3, 5])
reduced_x = Serializable.reduce(x)
print(reduced_x)
# ('scityping.numpy._ArrayType', _ArrayType.Data(data=ListArrayData(data=[1, 1, 2, 3, 5], dtype=dtype('int64'))))
json_x = json.dumps(x, default=scityping_encoder)
# "['scityping.numpy._ArrayType', {'data': {'data': [1, 1, 2, 3, 5], 'dtype': ['scityping.numpy.DType', {'desc': 'int64'}]}}]"
x2 = Serializable.validate(reduced_x)
x2 = Serializable.validate(json.loads(json_x)) # Equivalent
print(x2)
# array([1, 1, 2, 3, 5], dtype=int)
Note here that we use the .reduce()
, .deep_reduce()
and .validate()
methods attached to the base class: it is not necessary to know before hand the type of x
.
This package is my fifth iteration, over three years, of a simple system for defining Python types that package their own serialization/deserialization routines. The motivating use case is to support fully self-contained “units of work” for research computations; these are used both to ensure reproducibility by archiving work, and to transfer units of work between processes or machines. (For example between simulation and analysis scripts, or between laptop and server.) This use case imposed the following requirements:
- The serialized format must be portable and suitable for long-term storage (which precludes solutions like pickle or dill).
- The format should be plain text and, to the extent possible, human-readable. This further increases long-term reliability, since in the worst case, the file can be opened in a text editor and interpreted by a human.
- It must be possible to associate a pair of serialization/deserialization routines to a type.
- These routines must be recognized by Pydantic, since this is the serialization library I currently use.
- It should be possible for imported librairies to use custom serializers, without worrying about the order of imports. This was almost impossible to do with Pydantic v1.2
- It should be possible to associate custom serializers to already existing types. (E.g. add a serializer for
complex
.)3 - It should be possible to use generic base classes in a model definition, while subclasses are preserved after deserialization. For example, in the snippet below this is not the case:
Since it is often the cases that the same computational pipeline can be applied to many different types, the ability to define classes in terms of only generic base types is extremely useful.
from pydantic import BaseModel class Signal: ... class AMSignal(Signal): ... class FMSignal(Signal): ... class MyModel(BaseModel): signal: Signal sig = AMSignal(...) model = MyModel(signal=sig) type(model.signal) # AMSignal model2 = MyModel.parse_raw(model.json()) type(model2.signal) # Signal -> Serialization round trip does not preserve type.
The solution provided by this package satisfies all of these requirements.
The provided Serializable
class can be used to make almost any class serializable.
A standardized pattern is used to specify what needs to be serialized, and scityping takes care of the rest.
Note that types defined with Serializable
take precedence over those already supported by Pydantic; this allows to override builtin Pydantic serializers and deserializers for specific types.
Example: Define a serializable Signal
class.
# mysignal.py
from scityping import Serializable
from scityping.numpy import Array
from scityping.pydantic import dataclass
class Signal(Serializable):
@dataclass
class Data: # The `Data` class stores the minimal
values: Array[float, 2] # set of information required to
times : Array[int, 1] # recreate the `Signal` class.
freq : float
@staticmethod
def encode(signal: "Signal"): # The `encode` method defines how to extract values
return (signal.values, signal.times, signal.freq) # for the fields in `Data`
def __init__(values, times, freq):
...
## Analysis methods (mean, window, fft, etc.) go here ##
Now the class can be used as a type hint in a Pydantic model:
from scityping.pydantic import BaseModel
from typing import List
from mysignal import Signal # Type defined in the code block above
class Recorder(BaseModel):
signals: List[Signal]
Moreover, subclasses of Signal
can also be used and will be recognized correctly.
They only need to define their own Data
container (which may reuse the one from the base class, as we do below).
class AMSignal(Signal):
class Data(Signal.Data):
pass
signal = Signal(...)
amsignal = AMSignal(...)
rec = Recorder(signals=[signal, amsignal])
rec2 = Recorder.parse_raw(rec.json())
assert type(rec2.signals[0]) is Signal # Succeeds
assert type(rec2.signals[1]) is AMSignal # Also succeeds
(With standard Pydantic types, in the example above, values would all be coerced to the Signal
type after deserialization.)
The default decoder will pass the attributes of the Data
object to the class’ __init__
method as keyword arguments. For cases where this is unsatisfactory, one can define a custom decode
method. The example below uses this to return a true range
object, rather than the Range
wrapper:
class Range(Serializable, range):
@dataclass
class Data:
start: int
stop: Optional[int]=None
step: Optional[int]=None
def encode(self, r): return r.start, r.stop, r.step
def decode(data): return range(data.start, data.stop, data.step)
Note that while the decode
mechanism can help solve corner cases, in many situations the benefits are cosmetic.
Finally, extending an existing type with a serializer is just a matter of defining a subclass. For example, the Complex
type provided by scityping
is implemented as follows:
class Complex(Serializable, complex):
@dataclass
class Data:
real: float
imag: float
def encode(z): return z.real, z.imag
def decode(data): return complex(data.real, data.imag)
(When the subclass uses the same name (case-insensitive), a bit of magic is applied to associate it to the base type. Otherwise an extra Complex.register(complex)
is needed.)
It can then be used to allow serializing all complex numbers (not just instance of Complex
:
class Foo(BaseModel):
z: Complex
# Serializes even the original data type
Foo(z=3+4j).json()
# '{"z": ["__main__.Complex", {"real": 3.0, "imag": 4.0}]}'
Note that this works even if Complex
is defined after Foo
, without any patching of Foo
’s list of __json_encoders__
.4
Hint It is recommended that the nested Data
provide type-aware deserialization logic. The easiest way to do this is to use either scityping.pydantic.BaseModel
or scityping.pydantic.dataclass
. We do add basic deserialization support for the builtin dataclasses.dataclass
, but this is intentionally limited to a few basic types (int
, float
, str
) and Serializable
subclasses. If your data are very simple, this may be sufficient – see the docstring of scityping.Dataclass
for what exactly is supported.
For types serialized with Pydantic, this adds a single isinstance
check for each serialization call, which should be negligeable.5
Serialization with Serializable
is not as aggressively tested with regards to performance and is expected to be a slower.
Deserialization of Serializable
types in general will be slower, although this should only be noticeable if deserialization is a big part of your application. In some cases it may even be faster: Pydantic fields with Union
types may fall back to a try-catch for each possible type, keeping the first successful result.6 Since Serializable
includes the target type in serialized data, the correct data type is generally attempted first.
In general, within scientific applications these performance considerations should not matter: data is typically serialized/deserialized only at the beginning/end of a computation, which is not where the speed bottlenecks usually are.
The tests for xarray objects test functionality to right annex files using the NetCDF4 format, which requires Python 3.11. (Because we use hashlib.file_digest
to compute a unique filename.) These tests will only succeed with Python 3.11 or higher. Xarray objects can still be used with lower versions of Python, but then will only export to the NetCDF3 format.
To avoid unnecessarily downloading the massive CUDA dependencies just to run the PyTorch tests, consider installing the test dependencies with the following:
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install -e .[test]
(The CPU-only torch package is 200MB instead of 620MB.)
Footnotes
-
Large arrays are compressed, but saved alongside a human-readable snippet to allow inspection. ↩
-
This is because Pydantics set the list of serializers at class creation time. A possible workaround was to define a global dictionary of JSON encoders, but the behaviour still depends on the order of imports, and is thus fragile. I have not tested with v2 yet, but it should at least now provide the hooks needed to make this work. ↩
-
Pydantic v2 now offers similar functionality. ↩
-
The corollary of this is that it makes it easier for modules to arbitrarily modify how types are serialized by already imported modules. Thus by adding a new serializer, a new package may break a previously working one. Also, while the hooks for extending type suport don't increase the attack surface vis-à-vis a malicious actor (imported Python modules are already allowed to inject code wherever they please), they might make things easier for them. ↩
-
Unless a lot of builtin types are extended with serializers. ↩
-
In Pydantic v1, this try-catch approach to
Union
types was always used. I believe in v2 there is an attempt to identify the correct type (see “smart mode”), but due to Pydantic’s data model (which doesn’t include the type in the serialized data), this may still lead to multiple deserialization calls. ↩