-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Composing generators: good idea or great idea? #77
Comments
Generally 👍 . I have built things somewhat in this direction for work already, but this is slightly more elegant. I definitely agree that there are a number of |
This sounds slick. I'm not 100% sure I understand your example though: is the idea to make a decorator that can wrap simple functions as generators? Or as streamers? |
so here's a few thoughts... after mulling over other issues (namely #75), I think transforms would process arbitrary data streams, i.e. an "online" pipeline. So, yea, this decorator would make a function iterable. The decorator is one possible way of abstracting iteration from processing ... otherwise every transform is going to have the same This could also help solve some statistics problems I've had in the past, e.g. count class occurrence in the data stream for likelihood scaling. A note on ordering: I think the party line on the architecting streams is ...
|
This basically sounds like What I'm not quite getting is how this would shake out in practice. Where do the inputs to the function/streamer come from? Maybe it would help if you could mock up an example in the syntax that you have in mind? |
note: motivated by this PR comment here, so that it'll be easier to find in the future (rather than bury it in a PR thread). Definition: A One piece of functionality I think is ripe for a stream
But good news! We could implement def tuples(stream, keys):
for data in stream:
yield tuple(data[key] for key in keys) Which could be used as follows. def data_generator():
yield {'X': np.array([1]), 'Y': np.array([5])}
stream = pescador.Streamer(data_generator)
for data in tuples(stream(max_iter=20), keys=['Y', 'X']):
print(data) # Displays (array([5]), array([1])) Added bonus, this resolves the kwarg parsing (and possible conflicts) currently handled by the starting to be clearer @bmcfee ? |
PEDANTRY ALERT @ejhumphrey the usual terminology for this kind of operation is I don't understand what the problem of keeping the OTOH, I guess I'm okay with your proposed example. The question-mark for me has been how tuples would interact with buffering. But, I guess raw_stream = pescador.Streamer(data_generator)
stream = pescador.Streamer(buffer, raw_stream, n=3)
for data in tuples(stream, keys=['X', 'Y']):
print(data) # Displays (array([[1],[1],[1]]), array([[5],[5],[5]]) |
yes! it's a Now, to the matter at hand, working backwards from your two comments: re: OTOH... def data_generator():
yield {'X': np.array([1]), 'Y': np.array([5])}
def tuples(stream, keys):
for data in stream:
yield tuple(data[key] for key in keys)
def buffer_data(stream, buffer_size):
buff = []
for data in stream:
buff.append(data)
if len(buff) == buffer_size:
yield {key: np.array([x[key] for x in buff])
for key in for key in data}
buff = []
stream = pescador.Streamer(data_generator)
batches = buffer_data(stream, n=3)
for data in tuples(batches, keys=['Y', 'X']):
print(data) # Displays (array([[5],[5],[5]]), array([[1],[1],[1]]) Reflecting for a second, I think the reason that I've advocated for a different idiom for
Interestingly, the This is elegant / makes sense, so I 👍 . re: the problem of keeping tuples in Streamer Consider the following: values = [0, 1, 2]
streamer = pescador.Streamer(values)
for x in streamer.tuples(0):
print(x) Which would raise something like
That said, I'm willing to concede that this is more LBYL than EAFP, and if we leave |
Countering this, I could imagine situations in which you'd want to multiplex over batches. But you could easily do that if
... yeah, I'm not buying it. 0 is a valid dictionary key, so there's no reason it would know to fail until you go to access the data dict.
Just because an object is constructed well does not mean that any method call is fair game. You can always throw garbage into any method that accepts parameters. To be clear: I think I'm on board with moving Actually, I'd like to expand |
After this deprecation PR is out the door, I suppose I'll find out 😄
Fair enough.
I'm not sure what you mean at present, but I look forward to learning about it later. |
We could figure this out now. My main concern here is that I want buffering to play nicely with the zmqstreamer, given that it's a high-latency operation. raw_stream = [some nonsense]
buf_stream = Streamer(buffer_samples, 32, raw_stream)
parallel_stream = ZMQStreamer(buf_stream)
for batch in parallel_stream:
do some stuff That doesn't seem too bad.
Oh, basically just that if you have multiple inputs or outputs, they get packed in the tuple as train_gen = tuples(streamer, inputs=['X1', 'X2'], outputs=['Y1', 'Y2', 'Y3']) |
Three quick things: I agree, that code-block seems fine. I guess I was proposing that we modify / implement Multiple I/O -- I get it.
fwiw, I suspect buffering is high-latency at the moment because of how |
Eh, I think it's easiest to leave the current stuff in place, deprecate it later, and replace it with a similarly named function that behaves the way you want.
I'm not sure. There's a lot of weird control flow in there, but i think the biggest hit comes from all the memory copies. |
😳 what if I've already done it...? |
closed with #88 |
Part of what I dig about pescador (and leveraging generators for data sampling) is that it's easy to build pipelines of well-encapsulated transforms. They could be used from anything from data augmentation, e.g. additive noise, to glue layers, e.g. "I need one-hot encoded vectors from integer indexes". While pumpp aims to achieve this, I imagine the pescador audience will be more broad, and some simple tools would be good to help outline (what we think are) good design practices.
For example, I've been thinking for sometime about having a
pescador.transforms
submodule which provides some decorators to turn sample-wise functions into iterable transforms, like...This could allow for some easy fixes, like for #76 ...
I'd want to disable the decorator for testing, but I'm not sure that's possible (and it's not the end of the world). This would also allow for some other reshaping / renaming that goes on regularly, such as transforming
dict
s totuple
s:thoughts? Is this a good idea?
The text was updated successfully, but these errors were encountered: