Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary input and output use different formats #74

Open
WJCIV opened this issue Apr 6, 2016 · 1 comment
Open

Binary input and output use different formats #74

WJCIV opened this issue Apr 6, 2016 · 1 comment

Comments

@WJCIV
Copy link
Collaborator

WJCIV commented Apr 6, 2016

Binary input requires a control byte and batch size information before a set of tuples, whereas the output just contains the tuples themselves. This means that the output cannot be used as input without first putting it through some sort of intermediate program to insert the batch information.

The solution is probably a flag in the WorkerConfig class which indicates whether batch information is present in an input file. We will need to alter the code in InputBuffer.readFrom to handle the case where this flag is set to false.

@raulcf
Copy link
Owner

raulcf commented Apr 6, 2016

Thanks for clarifying this. I write my thoughts next.

When reading binary data we need to know its serialization format. If it's serialized with Kryo, then Kryo knows how to read it, and we should have a reader that knows this. If it's serialized with Thrift same thing.... If it is serialized with SEEP we know the format, it consists of [record_size][control_byte]...

So the idea is that when reading binary data, we need to provide the serialization format. Right now, we are always assuming that the binary data is in SEEP native format.

So the first thing to solve is to know in which format is the binary data serialized. This information should be contained in DataStore. There is a DataStore constructor that receives a Properties object. When creating a FileSource, for example, it is necessary to give it a Properties object that is in turn passed to the DataStore constructor. This means that DataStore knows the serialization format (specified in the Properties object through the serde.type parameter of FileConfig.

We'll need a factory class that creates an appropriate serde according to the type, reads the data and transforms it into SEEP native format, that is, adding the control byte, batch size, etc, etc.

We can revisit this issue and go to the details once someone hits the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants