Binary input and output use different formats #74

WJCIV · 2016-04-06T16:00:15Z

Binary input requires a control byte and batch size information before a set of tuples, whereas the output just contains the tuples themselves. This means that the output cannot be used as input without first putting it through some sort of intermediate program to insert the batch information.

The solution is probably a flag in the WorkerConfig class which indicates whether batch information is present in an input file. We will need to alter the code in InputBuffer.readFrom to handle the case where this flag is set to false.

raulcf · 2016-04-06T23:57:29Z

Thanks for clarifying this. I write my thoughts next.

When reading binary data we need to know its serialization format. If it's serialized with Kryo, then Kryo knows how to read it, and we should have a reader that knows this. If it's serialized with Thrift same thing.... If it is serialized with SEEP we know the format, it consists of [record_size][control_byte]...

So the idea is that when reading binary data, we need to provide the serialization format. Right now, we are always assuming that the binary data is in SEEP native format.

So the first thing to solve is to know in which format is the binary data serialized. This information should be contained in DataStore. There is a DataStore constructor that receives a Properties object. When creating a FileSource, for example, it is necessary to give it a Properties object that is in turn passed to the DataStore constructor. This means that DataStore knows the serialization format (specified in the Properties object through the serde.type parameter of FileConfig.

We'll need a factory class that creates an appropriate serde according to the type, reads the data and transforms it into SEEP native format, that is, adding the control byte, batch size, etc, etc.

We can revisit this issue and go to the details once someone hits the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary input and output use different formats #74

Binary input and output use different formats #74

WJCIV commented Apr 6, 2016

raulcf commented Apr 6, 2016

Binary input and output use different formats #74

Binary input and output use different formats #74

Comments

WJCIV commented Apr 6, 2016

raulcf commented Apr 6, 2016