Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-stochastic sampling and literature #104

Closed
stefan-balke opened this issue Oct 20, 2017 · 18 comments
Closed

Non-stochastic sampling and literature #104

stefan-balke opened this issue Oct 20, 2017 · 18 comments

Comments

@stefan-balke
Copy link

Hi pescadores,

first of all: thanks for this nice package!

I have used it now for some DNNs and it seems to work. However, two questions remain:

  • Can you point me to some literature where this stochastic sampling is explained or formalized or is it more an engineering thing.
  • Is it possible to configure the muxer that it guarantees to present all samples only once after a number of epochs? For instance, I have 1000 samples in 10 files and 5 samples constitute a mini-batch. An epoch ends after 10 mini-batches, resp. using 50 samples. This means that I want to see all samples after 20 epochs.

Maybe I haven't fully understood the muxing behaviour...but I want to! :)

Thanks
Stefan

@bmcfee
Copy link
Collaborator

bmcfee commented Oct 20, 2017

Can you point me to some literature where this stochastic sampling is explained or formalized or is it more an engineering thing.

Not really -- at one point, I had started on a write-up to analyze poisson scheduler, but it got nasty quite quickly. (It might be easier with negative binomial distribution instead of poisson, left as future work.)

Is it possible to configure the muxer that it guarantees to present all samples only once after a number of epochs?

Not currently, but it is being implemented as part of #96.

@stefan-balke
Copy link
Author

Okay, that PR is closed. Is it superseeded by #103?

@cjacoby
Copy link
Collaborator

cjacoby commented Oct 20, 2017 via email

@stefan-balke
Copy link
Author

@cjacoby thanks for the effort.

Can you explain how the above mentioned case is considered in your PR? I think ShuffledMux is closest. In my head this works:

  • We have 500 files, 100 Streams open
  • ShuffledMux samples until streams are empty, not reopening empty ones
  • Until all files were streams and are now emtpy
  • Then somehow a global reset and start over again.

Is this covered?

@cjacoby
Copy link
Collaborator

cjacoby commented Oct 28, 2017

Okay, lessee.

So, the behavior you describe would be (I think) best approximated by the following:

Use a PoissonMux (ShuffledMux can only sample N streams, where N==n_streams; PoissonMux allows you to choose N), in 'exhaustive' mode, ie:

def file_slicer(x):
    """
    Parameters
    ------------
    x : str
        Filename

    Yields
    ------
    sample
    """
    # TODO

streams = [pescador.Streamer(file_slicer, x) for x in your_files] 
file_mux = pescador.mux.PoissonMux(streams, k=100, rate=None, mode="exhaustive")

# Now, you would do this:
while True:
    for sample in file_mux.iterate():
        # do your thing here
    # the above will break out when all streams are exhausted

    # but when you loop back, all streams will be reset.

But another way you could run it is like this:

for sample in file_mux.cycle():
    # do your thing

cycle() is a Streamer function that just causes infinite generation, even if a StopIteration is thrown. However, if you want to have some special behavior (like counting epochs or something), you would currently have to do this yourself. Will probably add some kind of callback functionality in the future that would allow you to do this directly with the Streamers, but it's out of scope for this PR.

Does that help / make sense?

BTW, you can do the same thing right now in the 1.1 release (before #103 is complete) with the following Mux setting

pescador.Mux(streams, k=100, rate=None, with_replacement=False, revive=False)

Note: If you don't use rate=None, PoissonMux/old Mux default rate to 256, meaning the stream may not be completely used up before switching to a new stream. At 256, it would draw rng.poisson(lam=256) samples from that streamer, and then not use it again. With rate=None, it will just use all the samples from that streamer before choosing a new one.

@stefan-balke
Copy link
Author

Hey,

thanks for the explanations.

I think doing

pescador.Mux(streams, k=100, rate=None, with_replacement=False, revive=False)

is perfect for validation error.

In training, I think doing

pescador.Mux(streams, k=100, rate=None, with_replacement=False, revive=True)

could be a thing.

The only "missing" feature would now be that someone has to do the bookkeeping to check which streams are left in this epoch (an epoch is now defined as cycling through the whole dataset). What happens now is that a stream might get empty, is put in line again, and by chance can get activated as the next stream to open (though very unlikely).

Why am I so pedantic about these things? I think the behaviour of the muxer should be as transparent as possible. Considering DNN research, publications are already using this package and I fear that by these different sampling schemas, we may introduce new effects. This in turn means that we may discuss results which only could happen due to the sampling schema. It is another kind of "hyperparameter" we somehow have to address or at least be aware of.

However, I like the package and I'm using it because it may give me control about my data sampling, e.g. if I use keras sampler then I lose that control (@faroit pointed out that they do not scramble the mini-batches in each epoch...).

Thanks a lot.

P.S.: The rate=None could be documented more clearly. ATM it says:

If None, sample infinitely from each stream.

For me, "infinitely" means that it will be reseted after its empty but that should be controlled via revive=True. Maybe one could be clearer about that. Can do a PR for that if wanted.

@ejhumphrey
Copy link
Collaborator

ejhumphrey commented Oct 29, 2017

Considering DNN research, publications are already using this package and I fear that by these different sampling schemas, we may introduce new effects. This in turn means that we may discuss results which only could happen due to the sampling schema. It is another kind of "hyperparameter" we somehow have to address or at least be aware of.

💯 agreed! Team pescadores definitely supports such pedantry, e.g. RNG seeding or having good mechanisms for keeping an audit trail for sample presentation order (see #85).

I'd also point out (lament?) that, despite every meaningful neural network result in the last 20+ years leveraging stochastic gradient descent, only a small fraction of that research acknowledges the role that ordering will have on the models that result (and even smaller still do something about it, i.e. curriculum learning). I think more emphasis should be placed on how data are sampled, regardless of what tooling one uses for training. This probably gets a little muddy when doing things like asynchronous SGD / pooling gradients / other distributed madness ... but one problem at a time 😄 .

Bonus thought: personally, I'm not necessarily convinced that an "epoch" is a meaningful measurement of progress during training, but that's maybe a different discussion.

@cjacoby
Copy link
Collaborator

cjacoby commented Oct 30, 2017 via email

@bmcfee
Copy link
Collaborator

bmcfee commented Dec 15, 2017

Circling back on this after merging #103 --

I think what you actually need for deterministic (fixed sample) validation is either a roundrobin or chain mux in cycle mode. As long as the individual streamers are deterministic, this will ensure that you get the same data each time through the cycle. You will need to be careful to enforce two things:

  • Each streamer is deterministic, so that samples come out in the same order.
  • If using a chain mux, streamers have to be finite, with k[i] samples generated for streamer i.
  • The number of samples in the validation sweep is known ahead of time. For chain mux, this should be K = sum_i k[i]. For roundrobin, you'd be okay with K as any integer multiple of n_streamers, though it won't be exactly reproducible unless you use the formula for chain mux.

If you time this correctly, then the mux will be back at the beginning position when the next validation call happens, and you'll get the same sequence out.

We should add these examples to the documentation gallery.

@stefan-balke
Copy link
Author

Thanks for the explanations. Will look into this soon. Maybe can contribute first drafts for some docs since I have to read some things before I guess :)

@stefan-balke
Copy link
Author

Maybe we can also take a simple classification example such as MNIST in a jupyter notebook and check the influence on the accuracy when using different sampling schemas.

@bmcfee
Copy link
Collaborator

bmcfee commented Dec 15, 2017

Maybe we can also take a simple classification example such as MNIST in a jupyter notebook and check the influence on the accuracy when using different sampling schemas.

Sure, although I think it's simpler and cleaner to look directly at the statistics of samples, like we're doing in a few of the unit tests now. Imagine a list of streamers [1,2,...,n] where each streamer i just generates its index number i. For validation purposes, the sequence order within an epoch shouldn't matter, since you're averaging the results anyway and not updating the model sequentially, so what you should care about is whether the distribution of samples seen under some sampling strategy X matches what you would expect under an ideal scheme that generates all the data at once.

If it does, that's enough to imply that any function (eg, loss) you compute on that sample will match its expectation under the ideal scheme. That's a stronger argument than showing that it works for mnist, where just about anything will work.

BUT, we should definitely include a concrete example of how to do it properly, and mnist is as good a demo as any.

@bmcfee
Copy link
Collaborator

bmcfee commented Jan 19, 2018

I've added a couple of advanced examples in #114 (rtd) that should help with this issue.

More generally, returning to the two questions that sparked this thread:

Can you point me to some literature where this stochastic sampling is explained or formalized or is it more an engineering thing.

Once 2.0 is out, I'd like to write a paper formalizing what we did #32 and providing some empirical measurements of the poissonmux's output distribution under different regimes (active vs rate vs n_streams). I think this is out of scope for the core pescador project / documentation, but we'll certainly link back to it once it's out.

Is it possible to configure the muxer that it guarantees to present all samples only once after a number of epochs? For instance, I have 1000 samples in 10 files and 5 samples constitute a mini-batch. An epoch ends after 10 mini-batches, resp. using 50 samples. This means that I want to see all samples after 20 epochs.

I think this example covers this for validation using ChainMux. Training is a little trickier due to non-determinism, but you can use either PoissonMux or ShuffledMux with the cycle iterator to get this effect. Many caveats apply wrt the rate parameter though, as allowing different numbers of samples per streamer will effect the period of the sampler. If you just use rate=None and self-limit each streamer's generator, then this ought to work fine.

After I write up the training epoch example, will this issue be sufficiently addressed?

@bmcfee
Copy link
Collaborator

bmcfee commented Jan 19, 2018

Side note: I hacked up a quick and dirty experiment measuring the poissonmux's properties as a function of (n, k, rate). Notebook is here, and here's the result plots:

image

(x-axis is iterations, y-axis is entropy of the sample stream indices over time)

Take-aways:

  • n=k behaves like a full shuffled mux (I'm using mode=single_active), and provides a bound on the convergence of smaller active sets, which can only be slower.
  • Any curves left of the vertical iter=n line are burn-in noise, and should be ignored.
  • Smaller rates are better (as expected) but more expensive due to context switches.
  • The rate gap narrows as k increases (as expected)
  • These results are only for a single run per configuration, so take with a grain of salt.

@bmcfee
Copy link
Collaborator

bmcfee commented Jan 21, 2018

Here's an updated plot, running each config for 25 trials and showing the mean +- std across trials.

image

Gist notebook has been updated as well.

@stefan-balke
Copy link
Author

Super cool, thanks for these experiments. Makes it much easier to get an intuition!

@bmcfee
Copy link
Collaborator

bmcfee commented Jan 24, 2018

@stefan-balke do you think this issue is now sufficiently resolved?

@stefan-balke
Copy link
Author

Yes, thanks, closing this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants