Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composing generators: good idea or great idea? #77

Closed
ejhumphrey opened this issue Apr 3, 2017 · 14 comments
Closed

Composing generators: good idea or great idea? #77

ejhumphrey opened this issue Apr 3, 2017 · 14 comments
Assignees
Labels

Comments

@ejhumphrey
Copy link
Collaborator

Part of what I dig about pescador (and leveraging generators for data sampling) is that it's easy to build pipelines of well-encapsulated transforms. They could be used from anything from data augmentation, e.g. additive noise, to glue layers, e.g. "I need one-hot encoded vectors from integer indexes". While pumpp aims to achieve this, I imagine the pescador audience will be more broad, and some simple tools would be good to help outline (what we think are) good design practices.

For example, I've been thinking for sometime about having a pescador.transforms submodule which provides some decorators to turn sample-wise functions into iterable transforms, like...

def stateless(func):
     def wrapped(gen, *args, **kwargs):
         for samp in gen:
            yield func(samp, *args, **kwargs)
     return wrapped

This could allow for some easy fixes, like for #76 ...

@stateless
def newaxis(sample, *keys):
    for key in keys:
        sample[key] = sample.pop(key)[np.newaxis, ...]
    return sample

# Given a sample generator `stream`...
pescador.buffer_batch(newaxis(stream, 'x_in', 'y_true'), 10)

I'd want to disable the decorator for testing, but I'm not sure that's possible (and it's not the end of the world). This would also allow for some other reshaping / renaming that goes on regularly, such as transforming dicts to tuples:

@stateless
def values(sample, *keys):
    return [sample[key] for key in keys]

thoughts? Is this a good idea?

@cjacoby
Copy link
Collaborator

cjacoby commented Apr 3, 2017

Generally 👍 . I have built things somewhat in this direction for work already, but this is slightly more elegant.

I definitely agree that there are a number of transforms that are likely to be common use cases and useful.

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 4, 2017

thoughts? Is this a good idea?

This sounds slick. I'm not 100% sure I understand your example though: is the idea to make a decorator that can wrap simple functions as generators? Or as streamers?

@ejhumphrey
Copy link
Collaborator Author

so here's a few thoughts...

after mulling over other issues (namely #75), I think transforms would process arbitrary data streams, i.e. an "online" pipeline. So, yea, this decorator would make a function iterable.

The decorator is one possible way of abstracting iteration from processing ... otherwise every transform is going to have the same for x in stream: yield fx(x) block. This could also be achieved with a class for stateful processing, or if you needed to have a transform upstream of a mux.

This could also help solve some statistics problems I've had in the past, e.g. count class occurrence in the data stream for likelihood scaling.

A note on ordering: I think the party line on the architecting streams is ...

  1. Define a data sampler
  2. Build Streamers seeds over samplers
  3. Multiplex Streamer seeds
  4. Apply transforms to mux'ed stream
  5. Stick it behind a ZMQ interface

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 5, 2017

This basically sounds like map (mapcar if you're nasty), except without the target iterable of the map.

What I'm not quite getting is how this would shake out in practice. Where do the inputs to the function/streamer come from?

Maybe it would help if you could mock up an example in the syntax that you have in mind?

@ejhumphrey ejhumphrey self-assigned this Apr 9, 2017
@ejhumphrey
Copy link
Collaborator Author

ejhumphrey commented Apr 9, 2017

note: motivated by this PR comment here, so that it'll be easier to find in the future (rather than bury it in a PR thread).

Definition: A transform is an iterable function that transforms a data streams on an item by item basis.

One piece of functionality I think is ripe for a stream transform is the tuples functionality of a Streamer. The rationality for this is two-fold:

  • I've previously argued that Streamers should transparently yield data streams (as convention) to help avoid situations where data is implicitly (and / or unexpectedly) modified, i.e. idempotence is nice.
  • The tuples functionality can only work if the data stream owned by the Streamer consists of data, i.e. dicts of key: np.ndarray pairs. To keep the tuples method in the interface of Streamer, we would need to implement dynamic type checking or throw errors on bad data streams. This feels like an anti-pattern.

But good news! We could implement tuples as a transform:

def tuples(stream, keys):
    for data in stream:
        yield tuple(data[key] for key in keys)

Which could be used as follows.

def data_generator():
    yield {'X': np.array([1]), 'Y': np.array([5])}

stream = pescador.Streamer(data_generator)
for data in tuples(stream(max_iter=20), keys=['Y', 'X']):
    print(data)  # Displays (array([5]), array([1]))

Added bonus, this resolves the kwarg parsing (and possible conflicts) currently handled by the Streamer.tuples method.

starting to be clearer @bmcfee ?

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 9, 2017

PEDANTRY ALERT @ejhumphrey the usual terminology for this kind of operation is map .. maybe things would be clearer to me if we tried to stick to common functional programming practices.

I don't understand what the problem of keeping the tuples method of streamer is though. It would only ever be called by the user. None of the internal methods would ever use it, for exactly the reasons you describe above.

OTOH, I guess I'm okay with your proposed example. The question-mark for me has been how tuples would interact with buffering. But, I guess buffer can easily become a streamer map (just with slightly different semantics than other maps that generate data), so you could have something like

raw_stream = pescador.Streamer(data_generator)
stream = pescador.Streamer(buffer, raw_stream, n=3)

for data in tuples(stream, keys=['X', 'Y']):
    print(data)    # Displays (array([[1],[1],[1]]), array([[5],[5],[5]])

@ejhumphrey
Copy link
Collaborator Author

yes! it's a map! My experience with truly functional languages is academic (aware of but haven't used), and my knowledge falls somewhere between rusty and spotty. Consider it successful pedantry in the spirit of (re)education, and I thank you for your patience.

Now, to the matter at hand, working backwards from your two comments:

re: OTOH...
yes, I'm proposing that tuples and buffer to be map operations (similar to enumerate and zip). The way I was imagining this was like the following:

def data_generator():
    yield {'X': np.array([1]), 'Y': np.array([5])}

def tuples(stream, keys):
    for data in stream:
        yield tuple(data[key] for key in keys)

def buffer_data(stream, buffer_size):
    buff = []
    for data in stream:
        buff.append(data)
        if len(buff) == buffer_size:
            yield {key: np.array([x[key] for x in buff]) 
                   for key in for key in data}
            buff = []
            
stream = pescador.Streamer(data_generator)
batches = buffer_data(stream, n=3)

for data in tuples(batches, keys=['Y', 'X']):
    print(data)    # Displays (array([[5],[5],[5]]), array([[1],[1],[1]])

Reflecting for a second, I think the reason that I've advocated for a different idiom for map operations is two-fold:

  • I don't think that those functions (e.g. tuples, buffer_data) need the core functionality that a Streamer provides (rebooting generator functions / iterables). I say this in the spirit of "do one thing well" / encapsulation.
  • A Map is fundamentally a generator function, in that it should only take iterables, and any args / kwargs control it.

Interestingly, the Streamer(buffer, streamer, buffer_size=3) example you give satisfies both points: (a) buffer doesn't need to know about Streamer functionality, but you can get that functionality if you need it by combining them, and (b) Maps, being generator functions, are compatible with Streamers by design.

This is elegant / makes sense, so I 👍 .

re: the problem of keeping tuples in Streamer

Consider the following:

values = [0, 1, 2]
streamer = pescador.Streamer(values)
for x in streamer.tuples(0):
    print(x)

Which would raise something like IndexError, int has no getitem method or whatever. Yes, this is a dumb example; but, I use it to demonstrate that this method is invalid for an otherwise perfectly fine streamer. This feels wrong to me from a OOP / class design perspective, since the user can put the class instance in a potential failure mode without knowing it. I couldn't quickly find a reference to support my claim, but would expect there to be some guideline like:

A successfully initialized instance should be able to call all of its methods.

That said, I'm willing to concede that this is more LBYL than EAFP, and if we leave Streamer.tuples I'd want a more informative error about what "data" objects are and how the user effed up.

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 10, 2017

I don't think that those functions (e.g. tuples, buffer_data) need the core functionality that a Streamer provides (rebooting generator functions / iterables). I say this in the spirit of "do one thing well" / encapsulation.

Countering this, I could imagine situations in which you'd want to multiplex over batches. But you could easily do that if buffer is a generator that can be wrapped in a Streamer. It's just a question of how cumbersome it is do pull that off.

Which would raise something like IndexError, int has no getitem method or whatever. Yes, this is a dumb example;

... yeah, I'm not buying it. 0 is a valid dictionary key, so there's no reason it would know to fail until you go to access the data dict.

This feels wrong to me from a OOP / class design perspective, since the user can put the class instance in a potential failure mode without knowing it.

Just because an object is constructed well does not mean that any method call is fair game. You can always throw garbage into any method that accepts parameters.

To be clear: I think I'm on board with moving tuples out to its own function, but this isn't a great argument.

Actually, I'd like to expand tuples a little to support tuples of lists for keras's multi-input/output mode. This is easy though.

@ejhumphrey
Copy link
Collaborator Author

(re: Streamer(buffer...)) It's just a question of how cumbersome it is do pull that off.

After this deprecation PR is out the door, I suppose I'll find out 😄

To be clear: I think I'm on board with moving tuples out to its own function, but this isn't a great argument.

Fair enough.

Actually, I'd like to expand tuples a little to support tuples of lists for keras's multi-input/output mode. This is easy though.

I'm not sure what you mean at present, but I look forward to learning about it later.

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 11, 2017

After this deprecation PR is out the door, I suppose I'll find out

We could figure this out now. My main concern here is that I want buffering to play nicely with the zmqstreamer, given that it's a high-latency operation.

raw_stream = [some nonsense]
buf_stream = Streamer(buffer_samples, 32, raw_stream)
parallel_stream = ZMQStreamer(buf_stream)

for batch in parallel_stream:
    do some stuff

That doesn't seem too bad.

I'm not sure what you mean at present, but I look forward to learning about it later.

Oh, basically just that if you have multiple inputs or outputs, they get packed in the tuple as ([input1, input2, ...], [output1, output2, ...]). I agree that this is "hella dum", but that's how it works. The tuples function can support this pretty easily though, eg:

train_gen = tuples(streamer, inputs=['X1', 'X2'], outputs=['Y1', 'Y2', 'Y3'])

@ejhumphrey
Copy link
Collaborator Author

ejhumphrey commented Apr 11, 2017

Three quick things:

I agree, that code-block seems fine. I guess I was proposing that we modify / implement buffer_data (or whatever) in a subsequent PR; however, since that change would deprecate some functionality, sure, I can do it now.

Multiple I/O -- I get it.

given that it's a high-latency operation.

fwiw, I suspect buffering is high-latency at the moment because of how __split_batchis implemented, not the inherent act of buffering. That is mostly speculation though, and I look forward to benchmarking it.

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 11, 2017

I guess I was proposing that we modify / implement buffer_data (or whatever) in a subsequent PR

Eh, I think it's easiest to leave the current stuff in place, deprecate it later, and replace it with a similarly named function that behaves the way you want.

I suspect buffering is high-latency at the moment because of how __split_batchis implemented

I'm not sure. There's a lot of weird control flow in there, but i think the biggest hit comes from all the memory copies.

@ejhumphrey
Copy link
Collaborator Author

😳 what if I've already done it...?

@ejhumphrey
Copy link
Collaborator Author

closed with #88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants