Non-stochastic sampling and literature #104

stefan-balke · 2017-10-20T08:04:56Z

Hi pescadores,

first of all: thanks for this nice package!

I have used it now for some DNNs and it seems to work. However, two questions remain:

Can you point me to some literature where this stochastic sampling is explained or formalized or is it more an engineering thing.
Is it possible to configure the muxer that it guarantees to present all samples only once after a number of epochs? For instance, I have 1000 samples in 10 files and 5 samples constitute a mini-batch. An epoch ends after 10 mini-batches, resp. using 50 samples. This means that I want to see all samples after 20 epochs.

Maybe I haven't fully understood the muxing behaviour...but I want to! :)

Thanks
Stefan

bmcfee · 2017-10-20T14:17:04Z

Can you point me to some literature where this stochastic sampling is explained or formalized or is it more an engineering thing.

Not really -- at one point, I had started on a write-up to analyze poisson scheduler, but it got nasty quite quickly. (It might be easier with negative binomial distribution instead of poisson, left as future work.)

Is it possible to configure the muxer that it guarantees to present all samples only once after a number of epochs?

Not currently, but it is being implemented as part of #96.

stefan-balke · 2017-10-20T14:39:02Z

Okay, that PR is closed. Is it superseeded by #103?

cjacoby · 2017-10-20T16:11:55Z

Yep, indeed.

…

On Fri, Oct 20, 2017, 07:39 Stefan Balke ***@***.***> wrote: Okay, that PR is closed. Is it superseeded by #103 <#103>? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#104 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4t87Ic8grj9bzw5ZbM6a9K23xz3p88ks5suLCGgaJpZM4QAUa2> .

stefan-balke · 2017-10-27T12:35:39Z

@cjacoby thanks for the effort.

Can you explain how the above mentioned case is considered in your PR? I think ShuffledMux is closest. In my head this works:

We have 500 files, 100 Streams open
ShuffledMux samples until streams are empty, not reopening empty ones
Until all files were streams and are now emtpy
Then somehow a global reset and start over again.

Is this covered?

cjacoby · 2017-10-28T21:35:24Z

Okay, lessee.

So, the behavior you describe would be (I think) best approximated by the following:

Use a PoissonMux (ShuffledMux can only sample N streams, where N==n_streams; PoissonMux allows you to choose N), in 'exhaustive' mode, ie:

def file_slicer(x):
    """
    Parameters
    ------------
    x : str
        Filename

    Yields
    ------
    sample
    """
    # TODO

streams = [pescador.Streamer(file_slicer, x) for x in your_files] 
file_mux = pescador.mux.PoissonMux(streams, k=100, rate=None, mode="exhaustive")

# Now, you would do this:
while True:
    for sample in file_mux.iterate():
        # do your thing here
    # the above will break out when all streams are exhausted

    # but when you loop back, all streams will be reset.

But another way you could run it is like this:

for sample in file_mux.cycle():
    # do your thing

cycle() is a Streamer function that just causes infinite generation, even if a StopIteration is thrown. However, if you want to have some special behavior (like counting epochs or something), you would currently have to do this yourself. Will probably add some kind of callback functionality in the future that would allow you to do this directly with the Streamers, but it's out of scope for this PR.

Does that help / make sense?

BTW, you can do the same thing right now in the 1.1 release (before #103 is complete) with the following Mux setting

pescador.Mux(streams, k=100, rate=None, with_replacement=False, revive=False)

Note: If you don't use rate=None, PoissonMux/old Mux default rate to 256, meaning the stream may not be completely used up before switching to a new stream. At 256, it would draw rng.poisson(lam=256) samples from that streamer, and then not use it again. With rate=None, it will just use all the samples from that streamer before choosing a new one.

stefan-balke · 2017-10-29T13:04:07Z

Hey,

thanks for the explanations.

I think doing

pescador.Mux(streams, k=100, rate=None, with_replacement=False, revive=False)

is perfect for validation error.

In training, I think doing

pescador.Mux(streams, k=100, rate=None, with_replacement=False, revive=True)

could be a thing.

The only "missing" feature would now be that someone has to do the bookkeeping to check which streams are left in this epoch (an epoch is now defined as cycling through the whole dataset). What happens now is that a stream might get empty, is put in line again, and by chance can get activated as the next stream to open (though very unlikely).

Why am I so pedantic about these things? I think the behaviour of the muxer should be as transparent as possible. Considering DNN research, publications are already using this package and I fear that by these different sampling schemas, we may introduce new effects. This in turn means that we may discuss results which only could happen due to the sampling schema. It is another kind of "hyperparameter" we somehow have to address or at least be aware of.

However, I like the package and I'm using it because it may give me control about my data sampling, e.g. if I use keras sampler then I lose that control (@faroit pointed out that they do not scramble the mini-batches in each epoch...).

Thanks a lot.

P.S.: The rate=None could be documented more clearly. ATM it says:

If None, sample infinitely from each stream.

For me, "infinitely" means that it will be reseted after its empty but that should be controlled via revive=True. Maybe one could be clearer about that. Can do a PR for that if wanted.

ejhumphrey · 2017-10-29T16:54:07Z

Considering DNN research, publications are already using this package and I fear that by these different sampling schemas, we may introduce new effects. This in turn means that we may discuss results which only could happen due to the sampling schema. It is another kind of "hyperparameter" we somehow have to address or at least be aware of.

💯 agreed! Team pescadores definitely supports such pedantry, e.g. RNG seeding or having good mechanisms for keeping an audit trail for sample presentation order (see #85).

I'd also point out (lament?) that, despite every meaningful neural network result in the last 20+ years leveraging stochastic gradient descent, only a small fraction of that research acknowledges the role that ordering will have on the models that result (and even smaller still do something about it, i.e. curriculum learning). I think more emphasis should be placed on how data are sampled, regardless of what tooling one uses for training. This probably gets a little muddy when doing things like asynchronous SGD / pooling gradients / other distributed madness ... but one problem at a time 😄 .

Bonus thought: personally, I'm not necessarily convinced that an "epoch" is a meaningful measurement of progress during training, but that's maybe a different discussion.

cjacoby · 2017-10-30T17:35:59Z

P.S.: The rate=None could be documented more clearly. ATM it says: If None, sample infinitely from each stream. For me, "infinitely" means that it will be reseted after its empty but

that should be controlled via revive=True. Maybe one could be clearer about that. Can do a PR for that if wanted. This is definitely unclear, and perhaps even wrong. Documentation updates are pending on completion of #103, since it has a bunch of API changes for 2.0. I'll open an issue with this example so it is documented for when that happens. `rate=None` causes the underlying `pescador.Streamer()` to get launched with `.iterate(max_iter=None)`, which means that that streamer will run until it is empty. If that streamer is infinite, it will continue streaming infinitely, but that is actually a property of the underlying streamer you passed in, not the Mux. Whether or not that streamer gets restarted after it is empty or not is `revive` (in the pescador<=1.1). Minor followup on what Eric said, Mux auditing is on the roadmap / in discussion, but is out of scope on #103.

…

On Sun, Oct 29, 2017 at 9:54 AM Eric J. Humphrey ***@***.***> wrote: Considering DNN research, publications are already using this package and I fear that by these different sampling schemas, we may introduce new effects. This in turn means that we may discuss results which only could happen due to the sampling schema. It is another kind of "hyperparameter" we somehow have to address or at least be aware of. 💯 agreed! Team pescadores definitely supports such pedantry, e.g. RNG seeding <https://github.com/pescadores/pescador/blob/master/pescador/mux.py#L30> or having good audit trail for sample presentation order (see #85 <#85>). I'd also point out (lament?) that, despite every meaningful neural network result in the last 20+ years leveraging *stochastic* gradient descent, only a small fraction of that research acknowledges the role that ordering will have on the models that result (and even smaller still do something about it, *i.e.* curriculum learning). I think more emphasis should be placed on how data are sampled, regardless of what tooling one uses for training. This probably gets a little muddy when doing things like asynchronous SGD / pooling gradients / other distributed madness ... but one problem at a time 😄 . Bonus thought: personally, I'm not necessarily convinced that an "epoch" is a meaningful measurement of progress during training, but that's maybe a different discussion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#104 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4t84InZbzqryT9r0nLOfpzMhtz457Tks5sxK2vgaJpZM4QAUa2> .

bmcfee · 2017-12-15T16:23:54Z

Circling back on this after merging #103 --

I think what you actually need for deterministic (fixed sample) validation is either a roundrobin or chain mux in cycle mode. As long as the individual streamers are deterministic, this will ensure that you get the same data each time through the cycle. You will need to be careful to enforce two things:

Each streamer is deterministic, so that samples come out in the same order.
If using a chain mux, streamers have to be finite, with k[i] samples generated for streamer i.
The number of samples in the validation sweep is known ahead of time. For chain mux, this should be K = sum_i k[i]. For roundrobin, you'd be okay with K as any integer multiple of n_streamers, though it won't be exactly reproducible unless you use the formula for chain mux.

If you time this correctly, then the mux will be back at the beginning position when the next validation call happens, and you'll get the same sequence out.

We should add these examples to the documentation gallery.

stefan-balke · 2017-12-15T16:27:24Z

Thanks for the explanations. Will look into this soon. Maybe can contribute first drafts for some docs since I have to read some things before I guess :)

stefan-balke · 2017-12-15T16:28:36Z

Maybe we can also take a simple classification example such as MNIST in a jupyter notebook and check the influence on the accuracy when using different sampling schemas.

bmcfee · 2017-12-15T16:38:58Z

Maybe we can also take a simple classification example such as MNIST in a jupyter notebook and check the influence on the accuracy when using different sampling schemas.

Sure, although I think it's simpler and cleaner to look directly at the statistics of samples, like we're doing in a few of the unit tests now. Imagine a list of streamers [1,2,...,n] where each streamer i just generates its index number i. For validation purposes, the sequence order within an epoch shouldn't matter, since you're averaging the results anyway and not updating the model sequentially, so what you should care about is whether the distribution of samples seen under some sampling strategy X matches what you would expect under an ideal scheme that generates all the data at once.

If it does, that's enough to imply that any function (eg, loss) you compute on that sample will match its expectation under the ideal scheme. That's a stronger argument than showing that it works for mnist, where just about anything will work.

BUT, we should definitely include a concrete example of how to do it properly, and mnist is as good a demo as any.

bmcfee · 2018-01-19T15:24:16Z

I've added a couple of advanced examples in #114 (rtd) that should help with this issue.

More generally, returning to the two questions that sparked this thread:

Can you point me to some literature where this stochastic sampling is explained or formalized or is it more an engineering thing.

Once 2.0 is out, I'd like to write a paper formalizing what we did #32 and providing some empirical measurements of the poissonmux's output distribution under different regimes (active vs rate vs n_streams). I think this is out of scope for the core pescador project / documentation, but we'll certainly link back to it once it's out.

Is it possible to configure the muxer that it guarantees to present all samples only once after a number of epochs? For instance, I have 1000 samples in 10 files and 5 samples constitute a mini-batch. An epoch ends after 10 mini-batches, resp. using 50 samples. This means that I want to see all samples after 20 epochs.

I think this example covers this for validation using ChainMux. Training is a little trickier due to non-determinism, but you can use either PoissonMux or ShuffledMux with the cycle iterator to get this effect. Many caveats apply wrt the rate parameter though, as allowing different numbers of samples per streamer will effect the period of the sampler. If you just use rate=None and self-limit each streamer's generator, then this ought to work fine.

After I write up the training epoch example, will this issue be sufficiently addressed?

bmcfee · 2018-01-19T16:45:50Z

Side note: I hacked up a quick and dirty experiment measuring the poissonmux's properties as a function of (n, k, rate). Notebook is here, and here's the result plots:

(x-axis is iterations, y-axis is entropy of the sample stream indices over time)

Take-aways:

n=k behaves like a full shuffled mux (I'm using mode=single_active), and provides a bound on the convergence of smaller active sets, which can only be slower.
Any curves left of the vertical iter=n line are burn-in noise, and should be ignored.
Smaller rates are better (as expected) but more expensive due to context switches.
The rate gap narrows as k increases (as expected)
These results are only for a single run per configuration, so take with a grain of salt.

bmcfee · 2018-01-21T17:35:33Z

Here's an updated plot, running each config for 25 trials and showing the mean +- std across trials.

Gist notebook has been updated as well.

stefan-balke · 2018-01-23T10:56:04Z

Super cool, thanks for these experiments. Makes it much easier to get an intuition!

bmcfee · 2018-01-24T21:58:46Z

@stefan-balke do you think this issue is now sufficiently resolved?

stefan-balke · 2018-01-25T08:49:15Z

Yes, thanks, closing this out!

bmcfee added the question label Oct 20, 2017

cjacoby mentioned this issue Oct 30, 2017

Improve clarity on Mux Documentation for rate parameter #105

Closed

ejhumphrey mentioned this issue Oct 31, 2017

Proposal: epoch / randomness tracking or auditing, and callback API #85

Open

bmcfee added this to the 2.0.0 milestone Nov 30, 2017

bmcfee added the Documentation label Jan 8, 2018

bmcfee mentioned this issue Jan 8, 2018

documentation updates for 2.0.0 #114

Merged

3 tasks

bmcfee mentioned this issue Jan 22, 2018

Proposal: PoissonMux -> StochasticMux #119

Closed

stefan-balke closed this as completed Jan 25, 2018

bmcfee mentioned this issue Jul 19, 2019

Alternate distributions for StochasticMux #148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-stochastic sampling and literature #104

Non-stochastic sampling and literature #104

stefan-balke commented Oct 20, 2017

bmcfee commented Oct 20, 2017

stefan-balke commented Oct 20, 2017

cjacoby commented Oct 20, 2017 via email

stefan-balke commented Oct 27, 2017

cjacoby commented Oct 28, 2017

stefan-balke commented Oct 29, 2017

ejhumphrey commented Oct 29, 2017 •

edited

Loading

cjacoby commented Oct 30, 2017 via email

bmcfee commented Dec 15, 2017

stefan-balke commented Dec 15, 2017

stefan-balke commented Dec 15, 2017

bmcfee commented Dec 15, 2017

bmcfee commented Jan 19, 2018

bmcfee commented Jan 19, 2018

bmcfee commented Jan 21, 2018

stefan-balke commented Jan 23, 2018

bmcfee commented Jan 24, 2018

stefan-balke commented Jan 25, 2018

Non-stochastic sampling and literature #104

Non-stochastic sampling and literature #104

Comments

stefan-balke commented Oct 20, 2017

bmcfee commented Oct 20, 2017

stefan-balke commented Oct 20, 2017

cjacoby commented Oct 20, 2017 via email

stefan-balke commented Oct 27, 2017

cjacoby commented Oct 28, 2017

stefan-balke commented Oct 29, 2017

ejhumphrey commented Oct 29, 2017 • edited Loading

cjacoby commented Oct 30, 2017 via email

bmcfee commented Dec 15, 2017

stefan-balke commented Dec 15, 2017

stefan-balke commented Dec 15, 2017

bmcfee commented Dec 15, 2017

bmcfee commented Jan 19, 2018

bmcfee commented Jan 19, 2018

bmcfee commented Jan 21, 2018

stefan-balke commented Jan 23, 2018

bmcfee commented Jan 24, 2018

stefan-balke commented Jan 25, 2018

ejhumphrey commented Oct 29, 2017 •

edited

Loading