Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logical types #110

Open
josephglanville opened this issue Dec 2, 2019 · 3 comments
Open

Logical types #110

josephglanville opened this issue Dec 2, 2019 · 3 comments

Comments

@josephglanville
Copy link

Implementing logical types that provide automatic serialisation and deserialisation of higher level types into/from Avro primitive types is a highly deserable feature and is implemented by https://github.com/linkedin/goavro here.

Logical types are defined and described in the spec thus:

A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.

A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute. Language implementations may choose to represent logical types with an appropriate native type, although this is not required.

Language implementations must ignore unknown logical types when reading, and should use the underlying Avro type. If a logical type is invalid, for example a decimal with scale greater than its precision, then implementations should ignore the logical type and use the underlying Avro type.

The last paragraph is somewhat ambiguous. The way it's written implies only that the validity of the type declaration matters. Which for writing isn't a problem, for reading it should also not be a problem as the generated code always represents the reader schema.

However if you instead interpret it to mean that it should fall back to the underlying type when logical decoding fails then it's a significant bit less elegant as the generated types would now need to also contain a fallback field in the struct in order to write when logical decoding fails.

The Python library fastavro takes the approach of simply throwing a runtime error on invalid data. I will examine some of the other libraries to work out if there is a consensus on what the spec means here.

@rogpeppe
Copy link
Contributor

However if you instead interpret it to mean that it should fall back to the underlying type when logical decoding fails

There's only one case that I can see in the spec where decoding the underlying value for a logical type might fail, and that's when decoding a uuid type, which could fail to conform to the expected regular expression (aside: i wonder why they didn't choose a fixed type of size 24 for UUID, given that an RFC 4122 UUID is always 128 bits; then this issue wouldn't arise). For other logical types, ISTM that every possible bit representation is valid.

For UUID mismatch, I think it would be fine to throw a runtime error.

In any case, the error there isn't that the logical type is unknown (it's clearly the known logical type uuid), but that the value isn't well formed for the logical type, which is a different issue.

For the other cases it mentions (e.g. an invalid decimal type), I think the right approach would be to discard the logical type information when parsing the schema.

I've realised that there is actually an ambiguity in the spec. It says "Language implementations must ignore unknown logical types when reading" but doesn't make it clear if that applies to unknown logical types in the writer's schema or in the reader's schema. In the spirit of the spec, I think they probably meant the former, making it possible for a reader to read the underlying value even when it was written with a schema with an unknown logical type.

For errors in the reader's schema, I think it might be good for gogen-avro to at least have an option to give an error on unknown or malformed logical types, even if that's not the default, because otherwise it would be easy for mistakes to be inappropriately ignored.

One other thing about logical types: does anyone know of a decent Go package that implements fixed-precision decimal support in the style specified by Avro? There's a multi-precision decimal package, but that seems like it might be a bit heavyweight.

@josephglanville
Copy link
Author

So I think that is probably a reasonable compromise, that is ignoring invalid logical type definitions but raising errors on valid ones with malformed input (i.e UUIDs).

I haven't had time to dig into other implementations as things were very busy EOY and now it's into the holiday break. When I am back on the tools I will try get into this to validate what is consensus.

In regards to the decimal library I don't think there are any libraries that implement a fixed precision decimal quite in the style of the Avro spec. We are using github.com/shopspring/decimal and some conversion code I gisted here: https://gist.github.com/josephglanville/d1453fcf8a249721950026c0e376810a.

@AtakanColak
Copy link

@josephglanville Maybe there is a way to represent LogicalTypes with the schema package? I tried to add it as part of definitions but when I do a union over it, it breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants