Issue 40: Serializers Implementation for Schema Registry #41

shiveshr · 2020-06-25T09:15:16Z

Change log description
Serializers implementation for schema registry.
Refer to https://github.com/pravega/schema-registry/wiki/Sample-Usage to see how the usage for pravega applications will look like.

Purpose of the change
Fixes #40

What the code does
The SerializerFactory.java implements several serializers to be used in pravega applications.
It provides both generic and specific deserialization for avro, protobuf and json.
It also provides custom serialization support.

Config:
The serializer requires a SerializerConfig. At a minimum config takes Schema registry client config and group Id.
Users could also build the SerializerConfig with "create group" and "register Schema" options which will automatically create the group and register the schema before using it.

Serializers:
Each registry aware serializer derives from AbstractSerializer and AbstractDeserializer respectively.
These classes implement the core logic of talking to the registry service.
All format specific serializers and deserializers derive from these classes.
We provide specific (typed) and generic deserializers for protobuf, json and avro.
Json also has a deserialize to String deserializer.
We also provide multiplexed serializers and deserializers for all of the above formats.
There is a provision for users to supply their custom serializers and deserializers too.
And there are multi format deserializers which are capable of deserializing objects serialized with different formats into format specific generic objects.
There is a multi format Json String deserializer which deserializes protbuf, json and avro into json strings.

Cache:
Serializer factory also supplies an encoding cache to each serializer/deserializer it creates so that they dont have to talk to registry service for an encoding id they have already seen.

Codecs:
A codecFactory class is included which provides implementation for snappy and gzip codecs.
Schema registry config can optionally be built with encoders for serializers and decoders for deserializers.
If an encoder is supplied, all events will be encoded using the encoder after serializing.
0 or more decoders could be supplied to the Deserializer. The serializer factory, before creating the deserializer, talks with the registry service to get the list of codectypes registered with the service. If supplied codecTypes do not match the registered codecTypes, config setting fail on codectype mismatch could determine whether the deserializer initialization would fail or succeed.

How to verify it
Unit tests added.

Signed-off-by: Shivesh Ranjan <[email protected]>

fpj

It looks very nice, Shivesh. I have left a few minor comments. The only real concern I have is the split of serializer into different artifacts.

fpj · 2020-07-14T07:20:38Z

build.gradle

+    dependencies {
+        compile project(':common')
+        compile project(':client')
+        compile group: 'org.apache.avro', name: 'avro', version: avroVersion


Will it require a change to SerializerFactory? I'm concerned that this could be an API change, and even though we are making this beta, this is an anticipated change, it is not something that we learned later. If we can do it now, I'd say it is better.

serializers/src/main/java/io/pravega/schemaregistry/serializers/BaseDeserializer.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/SerializerFactoryHelper.java

serializers/src/test/java/io/pravega/schemaregistry/schemas/TestSchemas.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/BaseDeserializer.java

serializers/src/main/java/io/pravega/schemaregistry/schemas/JSONSchema.java

serializers/src/main/java/io/pravega/schemaregistry/schemas/ProtobufSchema.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/EncodingCache.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/ProtobufGenericDeserlizer.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/SerializerFactoryHelper.java

Signed-off-by: Shivesh Ranjan <[email protected]>

shrids

+1 Completed one more pass of the PR. No new comments from my end.

Signed-off-by: Shivesh Ranjan <[email protected]>

serializers/src/main/java/io/pravega/schemaregistry/schemas/Schema.java

serializers/src/main/java/io/pravega/schemaregistry/codec/Codecs.java

tkaitchuck · 2020-07-13T23:04:23Z

serializers/src/main/java/io/pravega/schemaregistry/codec/Codecs.java

+
+        @Override
+        public void encode(ByteBuffer data, ByteArrayOutputStream bos) throws IOException {
+            try (GZIPOutputStream gzipOS = new GZIPOutputStream(bos)) {


This is going to call close on the output stream. I am not sure we want that.
Should these codecs be "stackable"? IE could there be a Base64 codec which wrapps any one of these compression codecs.
If so then closing the output stream should be left to the caller.

This is going to call close on the output stream. I am not sure we want that. Should these codecs be "stackable"?

The codecs are not stackable.
However, why the close is inconsequential here is for two reasons:

Encode is the last operation we do on the outputstream.

we use ByteArrayOutputStream where close method is a no-op.

So calling close on bos is inconsequential and the Serializer does not impose any restriction. However, just from the sanctity of contract perspective, i will update the javadoc to illustrate explicitly that it is inconsequential if users close the outputstream.

Does this me we are precluding support for base64?
for pravega, maybe it never makes sense, because we do the decode before handing the data to user code (assuming both ends use this client). But is this intended to be used in any other contexts?

Does this me we are precluding support for base64?

No.
theoretically a user could create a codec for encoding and decoding with base64.

But is this intended to be used in any other contexts?
these serializers are specific to pravega.

Service currently generates encoding id for a single codec + schema pair.
If users want to stack multiple encodings, presently they would create a new codec, register its name, and then in the implementation for that codec they would perform multiple encodings.

for example:

compressAndBase64Encoder ==> void encode(bytes, outputstream) { b1 = compress(bytes) b2 = base64encode(b1) outputstream.write(b2) } bytes decode(bytes) { b1 = base64decode(bytes) b2 = uncompress(b1) return b2 }

If we want we could change this in future to allow for an ordered list of "codecs" for generating an encoding id. Presently that is not the case though.
I have created an issue to explore the possibilities #56

serializers/src/main/java/io/pravega/schemaregistry/codec/Codecs.java

serializers/src/main/java/io/pravega/schemaregistry/schemas/JSONSchema.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/AbstractDeserializer.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/AbstractSerializer.java

serializers/src/test/java/io/pravega/schemaregistry/serializers/CacheTest.java

Signed-off-by: Shivesh Ranjan <[email protected]>

fpj · 2020-07-15T07:11:45Z

build.gradle

+    dependencies {
+        compile project(':common')
+        compile project(':client')
+        compile group: 'org.apache.avro', name: 'avro', version: avroVersion


If applications are supposed to use the specific modules, why do we need the base module? If I understand correctly, the base module is a level of indirection to the specific modules.

serializers/src/main/java/io/pravega/schemaregistry/serializers/SerializerConfig.java

serializers/src/main/java/io/pravega/schemaregistry/codec/Codecs.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/BaseDeserializer.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/SerializerFactoryHelper.java

serializers/src/test/java/io/pravega/schemaregistry/schemas/TestSchemas.java

Signed-off-by: Shivesh Ranjan <[email protected]>

serializers/src/main/java/io/pravega/schemaregistry/serializers/BaseSerializer.java

serializers/src/main/java/io/pravega/schemaregistry/codec/Codecs.java

tkaitchuck · 2020-07-16T00:17:42Z

serializers/src/main/java/io/pravega/schemaregistry/codec/Codecs.java

+
+        @Override
+        public void encode(ByteBuffer data, ByteArrayOutputStream bos) throws IOException {
+            try (GZIPOutputStream gzipOS = new GZIPOutputStream(bos)) {


Does this me we are precluding support for base64?
for pravega, maybe it never makes sense, because we do the decode before handing the data to user code (assuming both ends use this client). But is this intended to be used in any other contexts?

serializers/src/main/java/io/pravega/schemaregistry/schemas/JSONSchema.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/AbstractSerializer.java

serializers/src/main/java/io/pravega/schemaregistry/serializers/MultiplexedSerializer.java

…ema object Signed-off-by: Shivesh Ranjan <[email protected]>

Signed-off-by: Shivesh Ranjan <[email protected]>

serializers/src/main/java/io/pravega/schemaregistry/serializers/AbstractDeserializer.java

tkaitchuck · 2020-07-21T05:22:28Z

serializers/src/main/java/io/pravega/schemaregistry/codec/Codecs.java

@@ -53,6 +53,13 @@ public CodecType getCodecType() {

        @Override
        public void encode(ByteBuffer data, ByteArrayOutputStream bos) {
+            if (data.hasArray()) {


This pattern is repeated in a few places, it may be worth making a utility function.

created issue #70

tkaitchuck · 2020-07-21T05:27:14Z

serializers/src/main/java/io/pravega/schemaregistry/serializers/AbstractDeserializer.java

+        if (this.encodeHeader) {
+            SchemaInfo writerSchema = null;
+            ByteBuffer decoded;
+            if (skipHeaders) {


I am not clear on why one would set encodeHeader to true and skipHeader to true as opposed to setting encodeHeader to false. It looks like in both cases it means that writerSchema is null.

Skip header is only an optimization flag used during deserialization.
It is used to tell the deserializer to skip the encoded header and not to talk to the service to resolve it to the write time schema.. instead use the supplied read time schema to read into.

This is relevant for protobuf and json typed deserializers where user supplies a read time schema to read into.

Signed-off-by: Shivesh Ranjan <[email protected]>

shiveshr added 30 commits June 7, 2020 19:23

copying over changes to new fork

76d99a3

Signed-off-by: Shivesh Ranjan <[email protected]>

copying over to new fork

b634867

Signed-off-by: Shivesh Ranjan <[email protected]>

serializers copying over

a0876cc

Signed-off-by: Shivesh Ranjan <[email protected]>

copying over

05cc270

Signed-off-by: Shivesh Ranjan <[email protected]>

contract

8ee1fef

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

193776d

remove </p>

600ff74

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'master' into contract

be5115e

merge with master

409c106

Signed-off-by: Shivesh Ranjan <[email protected]>

remove </p>

88af2e7

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

b1a8894

add name util and hash util

1961ba1

Signed-off-by: Shivesh Ranjan <[email protected]>

removing all unwanted auto generated swagger files

58e7f65

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

026b725

rename

7027d88

Signed-off-by: Shivesh Ranjan <[email protected]>

PR comment

4be3f40

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

cc223ec

marking interfaces as beta

a0a2db7

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

8a8c872

validation rules of list

3f64c11

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

fba77f6

remove Validation rules of list method

a449b4b

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

22a0a0d

PR comment

9a50464

Signed-off-by: Shivesh Ranjan <[email protected]>

PR comment

8cc3091

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

d063e7c

Removing schema validation rules

5a7d639

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge branch 'contract' into serializers

a9cca3a

license

0b6230a

Signed-off-by: Shivesh Ranjan <[email protected]>

merge with contract

911f793

Signed-off-by: Shivesh Ranjan <[email protected]>

shiveshr added 2 commits July 13, 2020 06:32

Merge branch 'master' into serializers

0f770fa

PR comments

9823974

Signed-off-by: Shivesh Ranjan <[email protected]>

This was referenced Jul 13, 2020

Optimize the write path that uses ByteArrayOutputStream which has an expanding buffer pravega/pravega#4942

Closed

Optimize the write path that uses ByteArrayOutputStream which has an expanding buffer #52

Open

fpj requested changes Jul 14, 2020

View reviewed changes

PR comments

7ee6fe0

Signed-off-by: Shivesh Ranjan <[email protected]>

shrids approved these changes Jul 14, 2020

View reviewed changes

PR comments

ff47289

Signed-off-by: Shivesh Ranjan <[email protected]>

tkaitchuck reviewed Jul 14, 2020

View reviewed changes

tkaitchuck reviewed Jul 15, 2020

View reviewed changes

serializers/src/test/java/io/pravega/schemaregistry/serializers/CacheTest.java Show resolved Hide resolved

Ranganaths8 requested a review from tkaitchuck July 15, 2020 04:40

PR comments

6c3fd87

Signed-off-by: Shivesh Ranjan <[email protected]>

fpj requested changes Jul 15, 2020

View reviewed changes

adding avro protobuf and json creator methods in WithSchema

2d6711d

Signed-off-by: Shivesh Ranjan <[email protected]>

fpj reviewed Jul 15, 2020

View reviewed changes

serializers/src/main/java/io/pravega/schemaregistry/serializers/BaseSerializer.java Outdated Show resolved Hide resolved

tkaitchuck reviewed Jul 16, 2020

View reviewed changes

shiveshr added 5 commits July 15, 2020 21:34

PR comments, javadoc fix, and return schema string instead of JsonSch…

ade09af

…ema object Signed-off-by: Shivesh Ranjan <[email protected]>

checkstyle

9cbaaae

Signed-off-by: Shivesh Ranjan <[email protected]>

PR comment

c0e57be

Signed-off-by: Shivesh Ranjan <[email protected]>

PR comments

174c4e6

Signed-off-by: Shivesh Ranjan <[email protected]>

revert rename of registry serializer factory

fe0974f

Signed-off-by: Shivesh Ranjan <[email protected]>

fpj approved these changes Jul 16, 2020

View reviewed changes

PR comment - generic T on JSONSchema

d464b6b

Signed-off-by: Shivesh Ranjan <[email protected]>

shiveshr force-pushed the serializers branch from 5bd4e63 to d464b6b Compare July 17, 2020 06:59

shiveshr added 2 commits July 17, 2020 01:06

Merge with master, use everit json schema

9137aa6

Signed-off-by: Shivesh Ranjan <[email protected]>

Merge with master

ed740ea

Signed-off-by: Shivesh Ranjan <[email protected]>

tkaitchuck approved these changes Jul 21, 2020

View reviewed changes

PR comments, javadoc corrections, test addition

a8fb0dd

Signed-off-by: Shivesh Ranjan <[email protected]>

shiveshr mentioned this pull request Jul 21, 2020

Refactor code to create a utility class for byte buffers #70

Open

shrids merged commit 9aac414 into pravega:master Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 40: Serializers Implementation for Schema Registry #41

Issue 40: Serializers Implementation for Schema Registry #41

shiveshr commented Jun 25, 2020 •

edited

Loading

fpj left a comment

fpj Jul 14, 2020

shrids left a comment

tkaitchuck Jul 13, 2020

shiveshr Jul 15, 2020 •

edited

Loading

tkaitchuck Jul 16, 2020

shiveshr Jul 16, 2020 •

edited

Loading

fpj Jul 15, 2020

tkaitchuck Jul 16, 2020

tkaitchuck Jul 21, 2020

shiveshr Jul 21, 2020

tkaitchuck Jul 21, 2020

shiveshr Jul 21, 2020

Issue 40: Serializers Implementation for Schema Registry #41

Issue 40: Serializers Implementation for Schema Registry #41

Conversation

shiveshr commented Jun 25, 2020 • edited Loading

fpj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrids left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiveshr Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiveshr Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiveshr commented Jun 25, 2020 •

edited

Loading

shiveshr Jul 15, 2020 •

edited

Loading

shiveshr Jul 16, 2020 •

edited

Loading