-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduled Source [proposed Label] feature #60
Comments
Basically you want to make an input reader which is pushes data instead of waiting for data to be polled? I'm not sure that's a new type of source. A new component between the source and the consumer which polls the source and notifies the consumer should accomplish that. |
We just need to have a source that is not synthetic. For that we have the FileSource (that reads from file), now we also have a TextFileSource and there will be soon an HDFS source. That should cover 60% of needs for a scheduled job. Let me know if it's otherwise. |
@WJCIV I dont think we need to change the way data are being handled. My main concerns are:
|
Both FileSource and TextFileSource are markers. They really only provide information to the next stage on how it should read data. Even if you write a custom source, the assumption right now is that each stage runs till completion.
In scheduled mode, we assume data is distributed across all nodes. I think you are facing a more involved use case. It may be worth to have a chat to clarify what is exactly that you need. That way we can create more specific issues. In the meantime, this thread is very useful. |
Currently the size of a Dataset will just grow until we run out of memory. Maybe we should add a maximum size and rate limit a reader to wait for something to be consumed to free up space?
That's a problem with the current code, and slightly less of a problem with the code in the branch I am editing right now. With the current code the reading buffer in the Dataset catches the writing buffer and calls flip on that buffer. Then the writer must be considered obsolete to avoid potentially overwriting something. The code I am working on detects that the writer has been caught and allocates a new buffer for the writer, so the writer always stays one step ahead. However, if there is nothing to be consumed the read still returns null (since there is not another record pending). In theory you could, as a consumer, know that you haven't actually reached the end of the input and keep polling if you get a null return. Then we have to worry about how you know you've actually reached the end of the stream. |
One natural way of dealing with this problem is to schedule the stage in rounds, except one single time and wait until completion. Can someone check what is Spark's strategy on 1.6? I think that's a good model to follow in this case.
I don't understand this point. |
We need a SEEP-Source implementation that is always available - meaining that it will never be scheduled. The Source receives data from network / or reads from file and keeps everything in memory. It then sends memory-data to the next dependent stage whenever is available (being scheduled).
For the basic implementation we need:
Request for comments: @raulcf @WJCIV
The text was updated successfully, but these errors were encountered: