Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled Source [proposed Label] feature #60

Open
pgaref opened this issue Feb 11, 2016 · 6 comments
Open

Scheduled Source [proposed Label] feature #60

pgaref opened this issue Feb 11, 2016 · 6 comments

Comments

@pgaref
Copy link
Contributor

pgaref commented Feb 11, 2016

We need a SEEP-Source implementation that is always available - meaining that it will never be scheduled. The Source receives data from network / or reads from file and keeps everything in memory. It then sends memory-data to the next dependent stage whenever is available (being scheduled).

For the basic implementation we need:

  • A network Source ( keep data in memory until next stage is available )
  • A file Source which could also read from HDFS
  • Modifications in the schedule description which needs to be aware of the Sources
  • Probably changes in the DataReferenceManager class

Request for comments: @raulcf @WJCIV

@pgaref pgaref changed the title Scheduled Source Scheduled Source label:feature Feb 11, 2016
@pgaref pgaref changed the title Scheduled Source label:feature Scheduled Source Feb 11, 2016
@pgaref pgaref changed the title Scheduled Source Scheduled Source [proposed Label] feature Feb 11, 2016
@WJCIV
Copy link
Collaborator

WJCIV commented Feb 16, 2016

Basically you want to make an input reader which is pushes data instead of waiting for data to be polled? I'm not sure that's a new type of source. A new component between the source and the consumer which polls the source and notifies the consumer should accomplish that.

@raulcf
Copy link
Owner

raulcf commented Feb 16, 2016

We just need to have a source that is not synthetic. For that we have the FileSource (that reads from file), now we also have a TextFileSource and there will be soon an HDFS source. That should cover 60% of needs for a scheduled job. Let me know if it's otherwise.

@pgaref
Copy link
Contributor Author

pgaref commented Feb 16, 2016

@WJCIV I dont think we need to change the way data are being handled. My main concerns are:

  • Sources should not be part of the schedule description -> Meaning that we are going to have two types of workers at the same time? Some of them running in scheduled mode and the ones running the Sources will be materialised? This decision though is made by the seep-master before loading the query - @raulcf Am I missing something here?
  • We need to be able to specify which specific worker(s) is going to act as a source (this is a more generic SEEP feature: being able to statically specify where an operator will be deployed). Then the stage DataReferences should be updated with the correct data-source address.
  • Of cource we already have some Source implementations ( I think we will definately need Network, File and HDFS ) but we need to integrate this logic to the scheduled version were we are going to face issues like: What happens when the data are not consumed fast enough? Or the other way round when the sources cannot produce enough load.. Again correct me if I am missing something

@raulcf
Copy link
Owner

raulcf commented Feb 16, 2016

Sources should not be part of the schedule description

Both FileSource and TextFileSource are markers. They really only provide information to the next stage on how it should read data. Even if you write a custom source, the assumption right now is that each stage runs till completion.

We need to be able to specify which specific worker(s) is going to act as a source
(this is a more generic SEEP feature: being able to statically specify where an
operator will be deployed)

In scheduled mode, we assume data is distributed across all nodes.

I think you are facing a more involved use case. It may be worth to have a chat to clarify what is exactly that you need. That way we can create more specific issues. In the meantime, this thread is very useful.

@WJCIV
Copy link
Collaborator

WJCIV commented Feb 16, 2016

What happens when the data are not consumed fast enough?

Currently the size of a Dataset will just grow until we run out of memory. Maybe we should add a maximum size and rate limit a reader to wait for something to be consumed to free up space?

Or the other way round when the sources cannot produce enough load

That's a problem with the current code, and slightly less of a problem with the code in the branch I am editing right now. With the current code the reading buffer in the Dataset catches the writing buffer and calls flip on that buffer. Then the writer must be considered obsolete to avoid potentially overwriting something. The code I am working on detects that the writer has been caught and allocates a new buffer for the writer, so the writer always stays one step ahead. However, if there is nothing to be consumed the read still returns null (since there is not another record pending). In theory you could, as a consumer, know that you haven't actually reached the end of the input and keep polling if you get a null return. Then we have to worry about how you know you've actually reached the end of the stream.

@raulcf
Copy link
Owner

raulcf commented Feb 16, 2016

Currently the size of a Dataset will just grow until we run out of memory. Maybe
we should add a maximum size and rate limit a reader to wait for something to be
consumed to free up space?

One natural way of dealing with this problem is to schedule the stage in rounds, except one single time and wait until completion. Can someone check what is Spark's strategy on 1.6? I think that's a good model to follow in this case.

Or the other way round when the sources cannot produce enough load

I don't understand this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants