Add MessagePack item serializer and validator #628

jarmuszz · 2024-08-18T19:54:28Z

This pull request brings serialization and validation to the msgpack.low module.

Serialization is provided for the outside use through tobinary: Pipe[F, MsgpackItem, Byte] and toNonValidatedBinary: Pipe[F, MsgpackItem, Byte] functions. Akin to how the cbor.low serialization is constructed, serialization without validation can potentialy produce malformed data.

Validation API is also exposed through validated: Pipe[F, MsgpackItem, MsgpackItem].

And fix issues found during testing

This change applies only to types in which leading zeros are insignificant.

The parser mapping ByteVector to MsgpackItem can be seen as a not injective morphism, that is, there are many ByteVectors that will map to the same MsgpackItem. Because of this, we cannot possibly guarantee that `serialize(parse(bs))` is fixpoint for an arbitrary `bs`. However, currently implemented serializers *are* injective (if we exclude the Timestamp format family as it can be represented with Extension types) and so, we can guarantee `serialize(parse(bs)) == bs` if `bs` is a member of a subset of ByteVector that is emitted by a serializer. In other words, the following code will be true for any `bs` if `serialize` is injective and we ignore the Timestamp type family: ``` val first = serialize(parse(bs)) val second = serialize(parse(first)) first == second ``` This test makes sure that the above holds.

- There was very little performance difference between serializers so the `fast` serializer was entirely scrapped. - The current serializer buffers the output in 4KiB segments before emitting it. This change brought a significant speedup.

jarmuszz · 2024-09-07T11:04:32Z

The initial concept of having two serializers was dropped as they ran at a nearly identical time 😃.

These are the current benchmarks on i7-8550U:

Benchmark                                       Mode  Cnt     Score    Error  Units
MsgPackItemSerializerBenchmarks.serialize       avgt   10  3337.518 ± 39.333  us/op
MsgPackItemSerializerBenchmarks.withValidation  avgt   10  5107.575 ± 65.148  us/op

Validation seems to slow things down a bit and I think that maybe it is possible to make it a little faster.

msgpack/src/main/scala/fs2/data/msgpack/low/internal/ItemSerializer.scala

satabin · 2024-09-07T12:04:11Z

msgpack/src/main/scala/fs2/data/msgpack/low/internal/ItemSerializer.scala

+    *
+    * @param contents buffered [[Chunk]]
+    */
+  private class Out[F[_]](contents: Chunk[Byte]) {


I like the approach, however, this creates a new Out instance for each incoming item. If you run the benchmark with allocation and CPU flame graphs, does it appear to be a bottleneck?

msgpack/src/main/scala/fs2/data/msgpack/low/internal/ItemValidator.scala

msgpack/src/main/scala/fs2/data/msgpack/low/package.scala

satabin · 2024-09-07T12:08:25Z

Thanks a ton for this new contribution. I had a first look at it and left a few comments. But it looks great already. 👏

Reflects changes made in 309569e

ybasket

Looks good overall 👏 , left only some smaller comments.

ybasket · 2024-09-11T09:14:11Z

benchmarks/src/main/scala/fs2/data/benchmarks/MsgPackItemSerializerBenchmarks.scala

+        .compile
+        .string


If I'm not mistaken, this destroys all chunking, you'll work with one gigantic Chunk – not what real life code should usually do. I think you can simply combine those two streams roughly along the lines of

fs2.io .readClassLoaderResource[SyncIO]("twitter_msgpack.txt", 4096) .through(fs2.text.utf8.decode) .map(str => Chunk.byteVector(ByteVector.fromHex(str).get)) .unchunks .through(fs2.data.msgpack.low.items[SyncIO]) .compile .toList .unsafeRunSync()

ybasket · 2024-09-11T09:16:11Z

msgpack/src/main/scala/fs2/data/msgpack/low/internal/ItemSerializer.scala

+  class MalformedItemError extends Error("item exceeds the maximum size of it's format")
+  class MalformedStringError extends MalformedItemError
+  class MalformedBinError extends MalformedItemError
+  class MalformedIntError extends MalformedItemError
+  class MalformedUintError extends MalformedItemError


Are those mapped somewhere? Because as they're within a private[low], users won't have access to them and can't distinguish what failed. Maybe something we'd want to allow? The CSV module has public exception types.

I have decided that having separate exception for each type is a bit too granular. Instead, there's a general MsgpackMalformedItemException class since 482bf9e.

msgpack/src/main/scala/fs2/data/msgpack/low/internal/ItemSerializer.scala

ybasket · 2024-09-11T09:37:55Z

msgpack/src/test/scala/fs2/data/msgpack/ValidationSpec.scala

+          .map(_ => failure(s"Expected error for item ${lhs}"))
+          .handleError(expect.same(_, rhs))


Minor: Understanding this bit is a bit tricky as it's unusual to expect the exception. I think redeem could make things more clear:

Suggested change

.map(_ => failure(s"Expected error for item ${lhs}"))

.handleError(expect.same(_, rhs))

.redeem(expect.same(_, rhs), _ => failure(s"Expected error for item ${lhs}"))

The serializer itself was corrected in 041e135

MessagePack Arrays and Maps can hold up to 2^32 - 1 items which is more than the `Int` type can represent without negative values.

Also drop the `fitsIn` function as we now use `Long`s instead of `Int`s and so we don't need to compare unsigned values.

jarmuszz and others added 12 commits August 11, 2024 15:58

Add msgpack item serializer and validator

8395a35

Merge branch 'gnieh:main' into main

f051c31

Fix one additional argument being passed

52b48d7

Add tests to cover all fmts for msgpack serializer

7b4246e

And fix issues found during testing

Remove debug code in serializer test

9bc1452

Add validation test cases

eed74a4

Make msgpack item serializer omit leading zeros

54d7d55

This change applies only to types in which leading zeros are insignificant.

Make Extension tests use ByteVector.fill

1ed1f73

Remove redundant padLefts when size is known

671e693

Reformat ValidationSpec.scala

b732583

Remove scaladoc from an embedded function

3f30e33

jarmuszz mentioned this pull request Aug 18, 2024

Add MessagePack support #603

Open

10 tasks

jarmuszz added 5 commits August 24, 2024 15:39

Add benchmars for msgpack item serializer

aa8658a

Merge msgpack serializers

cd9782e

- There was very little performance difference between serializers so the `fast` serializer was entirely scrapped. - The current serializer buffers the output in 4KiB segments before emitting it. This change brought a significant speedup.

Make SerializerSpec no longer extend Checkers

cdc4894

Make msgpack.low API similar to cbor.low API

309569e

Update msgpack serializer spec documentation

3d717a3

jarmuszz marked this pull request as ready for review September 7, 2024 11:05

jarmuszz requested a review from a team as a code owner September 7, 2024 11:05

satabin reviewed Sep 7, 2024

View reviewed changes

msgpack/src/main/scala/fs2/data/msgpack/low/internal/ItemSerializer.scala Outdated Show resolved Hide resolved

satabin reviewed Sep 7, 2024

View reviewed changes

msgpack/src/main/scala/fs2/data/msgpack/low/internal/ItemValidator.scala Outdated Show resolved Hide resolved

satabin reviewed Sep 7, 2024

View reviewed changes

msgpack/src/main/scala/fs2/data/msgpack/low/package.scala Outdated Show resolved Hide resolved

jarmuszz added 4 commits September 10, 2024 20:08

Change msgpack.low.toBinary scaladoc

deede3f

Reflects changes made in 309569e

Fix msgpack doc generation

698f727

Add doc for msgpack.low public methods

248fbc6

Run prePR

4760221

ybasket reviewed Sep 11, 2024

View reviewed changes

jarmuszz added 6 commits September 14, 2024 12:07

Extract literals into constants

041e135

Fix msgpack serialization test of negative fixint

2be8831

The serializer itself was corrected in 041e135

Make msgpack Array and Map use Long for sizes

fd845e8

MessagePack Arrays and Maps can hold up to 2^32 - 1 items which is more than the `Int` type can represent without negative values.

Make msgpack exceptions public

482bf9e

Move Pull.pure(None) into a constant

8d67768

Use bit shifts instead of Math.pow(2, n)

05c4c1c

Also drop the `fitsIn` function as we now use `Long`s instead of `Int`s and so we don't need to compare unsigned values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MessagePack item serializer and validator #628

Add MessagePack item serializer and validator #628

jarmuszz commented Aug 18, 2024 •

edited

Loading

jarmuszz commented Sep 7, 2024

satabin Sep 7, 2024

satabin commented Sep 7, 2024

ybasket left a comment

ybasket Sep 11, 2024

ybasket Sep 11, 2024

jarmuszz Sep 23, 2024

ybasket Sep 11, 2024

		.map(_ => failure(s"Expected error for item ${lhs}"))
		.handleError(expect.same(_, rhs))

	.map(_ => failure(s"Expected error for item ${lhs}"))
	.handleError(expect.same(_, rhs))
	.redeem(expect.same(_, rhs), _ => failure(s"Expected error for item ${lhs}"))

Add MessagePack item serializer and validator #628

Are you sure you want to change the base?

Add MessagePack item serializer and validator #628

Conversation

jarmuszz commented Aug 18, 2024 • edited Loading

jarmuszz commented Sep 7, 2024

satabin Sep 7, 2024

Choose a reason for hiding this comment

satabin commented Sep 7, 2024

ybasket left a comment

Choose a reason for hiding this comment

ybasket Sep 11, 2024

Choose a reason for hiding this comment

ybasket Sep 11, 2024

Choose a reason for hiding this comment

jarmuszz Sep 23, 2024

Choose a reason for hiding this comment

ybasket Sep 11, 2024

Choose a reason for hiding this comment

jarmuszz commented Aug 18, 2024 •

edited

Loading