Alternate distributions for StochasticMux #148

bmcfee · 2019-07-16T13:58:30Z

I've been doing a bit of analysis of pescador's sampling behavior, and the Poisson distribution is a bit unwieldy. However, for a specific choice of parameters, drawing n_samples_to_stream from a binomial distribution leads to a clean solution in terms of total replacement rate of streamers.

More generally, I think it would be useful to extend StochasticMux to have a dist= parameter that works as follows:

constant: each streamer gets exactly (at most) rate samples.
binomial (proposed new default): each streamer gets a binomial draw from Binomial(rate, 1-1/n_active)
poisson: current behavior, n_samples ~ 1 + Poisson(rate)

The text was updated successfully, but these errors were encountered:

bmcfee · 2019-07-17T14:13:43Z

Quick summary of the analysis: say we have uniform weights over n_active active streamers. Each streamer has a rate limit that's sampled from a binomial distribution

n_samples ~ Binomial(rate * n_active / (n_active - 1), (n_active - 1) / n_active).

The expectation of n_samples is rate
Pick any active streamer, and let M denote the (random) number of samples we need to draw from the mux before that streamer gets replaced. This has expected value E[M] = rate * n_active.
The variance is Var[M] = rate * n_active**2.

These expectations match exactly the case where the number of samples is deterministic: n_samples = rate. The benefit of randomizing n_samples is that you don't get a mass extinction event every rate + K samples; the replacements are less concentrated.

In exhaustive mode, we want to know how long until we hit all the streamers. We can think of this as each streamer slot being replaced n_streams/n_active - 1 times, and the expected time cooks down to rate * (n_streams - n_active) / n_active (exactly what you'd hope for).

Finally, we'd want to know what the chance of a streamer not being replaced is. This can be calculated numerically using the negative binomial survival function (which is ugly AF), but we could also be lazy and use some concentration inequalities. If we think of rate * n_active as the average time of an 'episode', then we could ask what the chance of a streamer lasting some b>=1 episodes longer than average. This probability is upper-bounded by 1/(b**2 * rate); so with high probability, every streamer will be replaced quickly and the distribution should not become unbalanced. (Corollary: the approximation for time to exhaustion above should be pretty good.)

lostanlen · 2019-07-17T14:42:22Z

Cool!

So is pescador going to be renamed to pascalador then?*

*(https://en.wikipedia.org/wiki/Binomial_distribution#History)

bmcfee · 2019-07-17T14:48:33Z

So is pescador going to be renamed to pascalador then?*

Negative.

cjacoby · 2019-07-18T15:14:19Z

👍 💯

bmcfee · 2019-07-19T19:31:21Z

Quick update: I implemented this and compared poisson/binomial/constant using the notebook linked from #104. This basically computes the rate at which the muxed distribution approaches the full distribution. The idealized case with n_active = n_total is given in the right-most columns for context.

TLDR there's not much difference across the different schemes, but the scalloping #132 effects are obscured by error regions in these plots. They'll be more pronounced for binomial and constant; less for poisson. The randomized offset idea mentioned at the end of #132 could be a nice solution to that though, and is not tested here.

Results:

bmcfee · 2020-01-08T17:46:41Z

Would anyone object to me implementing this and pushing out a 2.2 release, rather than waiting for 3.x?

cjacoby · 2020-01-08T17:59:13Z

No objection; available for code review.

…

-C

On Wed, Jan 8, 2020 at 9:46 AM Brian McFee ***@***.***> wrote: Would anyone object to me implementing this and pushing out a 2.2 release, rather than waiting for 3.x? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#148?email_source=notifications&email_token=AAHC34ZM3DUYS25J4XBNR6LQ4YGQDA5CNFSM4IEBIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINMOUQ#issuecomment-572180306>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHC346MWIOQSSCC5H7TNETQ4YGQDANCNFSM4IEBIMDA> .

bmcfee · 2020-01-08T19:11:18Z

One irritating snag here: the binomial distribution explodes on us if n_active=1, because the rate parameter gets scaled up by n_active / (n_active - 1). This is a weird edge case that we never had to worry about before, since the poisson model didn't mix in the active set size.

For n_active > 1, the binomial model has the invariant that the expected number of samples per streamer is always rate, and the variance scales like rate / n_active. So large active sets mean low variance in the number of samples per streamer; small active sets should have high variance in the number of samples per streamer. When n_active=1, maintaining this would require a random variable with mean = variance = rate, eg a poisson variable.

Does it make sense to fall back on poisson for this case? Or should we simply require n_active >= 2 for dist='binomial'? The latter side-steps the issue, but makes the API somewhat inelegant.

cjacoby · 2020-01-09T06:15:38Z

I'm fine with falling back to poisson for n_active == 1, as long as it is clearly documented in the docstring. My gut is that restricting n_active >= 2 is a worse use experience than falling back to poisson, but I'm not confident in that feeling right now. I suppose we could also raise a warning to make it clear to the end user that that had happened.

…

On Wed, Jan 8, 2020 at 11:11 AM Brian McFee ***@***.***> wrote: One irritating snag here: the binomial distribution explodes on us if n_active=1, because the rate parameter gets scaled up by n_active / (n_active - 1). This is a weird edge case that we never had to worry about before, since the poisson model didn't mix in the active set size. For n_active > 1, the binomial model has the invariant that the expected number of samples per streamer is always rate, and the variance scales like rate / n_active. So large active sets mean low variance in the number of samples per streamer; small active sets should have high variance in the number of samples per streamer. When n_active=1, maintaining this would require a random variable with mean = variance = rate, eg a poisson variable. Does it make sense to fall back on poisson for this case? Or should we simply require n_active >= 2 for dist='binomial'? The latter side-steps the issue, but makes the API somewhat inelegant. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#148?email_source=notifications&email_token=AAHC345HMIO3KLUDLOOPHFTQ4YQNNA5CNFSM4IEBIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINUQIA#issuecomment-572213280>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHC343ZWL67YWZVPLKDAEDQ4YQNNANCNFSM4IEBIMDA> .

bmcfee · 2020-01-09T19:26:52Z

My gut is that restricting n_active >= 2 is a worse use experience than falling back to poisson, but I'm not confident in that feeling right now.

I agree with that. I went and redid the analysis allowing for non-uniform streamer weights, and it came out a bit more nuanced. The end story is the same though, convergence to poisson makes sense when we only have one active streamer and still want a randomized refresh.

In implementing this, I ran into some snags with the unit tests. In the poisson rate model, we hacked a bias of +1 into the sampler to avoid having streamers that start and stop without generating data. This is mostly a desirable property in practice, but it messes with the analysis pretty badly. I'd like to avoid that in the binomial model. However, this means that the binomial model does need to handle the n=0 case gracefully. Since our current implementation never had to handle that, it was never explicitly tested .. unsurprisingly, it doesn't work. Will probably take a good amount of hacking to sort it out.

bmcfee · 2020-01-14T20:58:58Z

quick update: the problem had to do with pruning empty streams.

bmcfee · 2020-01-15T00:14:57Z

To clarify my earlier comment: the issue is that if we set n_samples_per_stream = 0 (randomly selected from binomial or poisson), then a stream can be activated, never produce samples, and then be deactivated and removed from the stream, even though it could produce new data.

What do you (all) think about removing the stream pruning behavior? Or at least, changing the default behavior to not prune. I think it's ultimately inconsistent with our design principles that streams should be restartable (that being a distinguishing point compared to iterables).

cjacoby · 2020-01-17T18:55:42Z

I am pro changing the default behavior in this way. I think it would be nice to get some more input from other users on how they typically use Pescador. *I *at least hardly used the pruning behavior, so to me this is fine, but maybe there are people that do?

…

On Tue, Jan 14, 2020 at 4:14 PM Brian McFee ***@***.***> wrote: To clarify my earlier comment: the issue is that if we set n_samples_per_stream = 0 (randomly selected from binomial or poisson), then a stream can be activated, never produce samples, and then be deactivated and removed from the stream, even though it *could* produce new data. What do you (all) think about removing the stream pruning behavior? Or at least, changing the default behavior to not prune. I think it's ultimately inconsistent with our design principles that streams should be restartable (that being a distinguishing point compared to iterables). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#148?email_source=notifications&email_token=AAHC344RHSZEC6WCLSM3KFLQ5ZIQDA5CNFSM4IEBIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI6TAVI#issuecomment-574435413>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHC3475IOBNDPYD2XK7PBTQ5ZIQDANCNFSM4IEBIMDA> .

bmcfee · 2024-01-31T19:57:17Z

Since I'm finally getting around to doing some cleanup here for a 3.0 release, I'll quickly summarize how I think this should all go:

Modify the poisson mode so that we do 1 + poisson(λ-1) instead of 1 + poisson(λ). This way, the expected value is λ but we never terminate a streamer without trying to pull from it at least once.
Add a constant mode that always returns (up to) λ samples. Streamers can self-terminate of course, but there would be no randomness from the mux side in this mode.
Add a binomial mode where the number of samples is 1 + Bin((λ-1)/(1-q), 1-q) with q as the (normalized) weight attached to that streamer. This avoids the problem identified earlier (pruning a stream prematurely) while preserving the expected value of λ.
With the above, it won't matter so much if we retain the default (True) for pruning empty streams.
We still need to handle the q=1 case, ie where there is only one active streamer, when in binomial mode. Under the definition above, the binomial would be undefined, but we can appeal to the Poisson limit theorem to replace it by a draw from Poisson(λ-1) as the limiting distribution as q→0 (and n→∞). (Hurray, binomial is still a little fishy, so the name isn't misleading!) I'm happy with this solution, as it behaves consistently and still provides the right kind of randomness in the edge cases.

bmcfee added enhancement API labels Jul 16, 2019

bmcfee modified the milestones: 2.1.0, 3.0.0 Jul 16, 2019

bmcfee added a commit that referenced this issue Feb 1, 2024

implemented #148

dae9d93

bmcfee mentioned this issue Feb 1, 2024

Revised distributions for stochasticmux #171

Merged

bmcfee closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate distributions for StochasticMux #148

Alternate distributions for StochasticMux #148

bmcfee commented Jul 16, 2019

bmcfee commented Jul 17, 2019

lostanlen commented Jul 17, 2019

bmcfee commented Jul 17, 2019

cjacoby commented Jul 18, 2019

bmcfee commented Jul 19, 2019

bmcfee commented Jan 8, 2020

cjacoby commented Jan 8, 2020 via email

bmcfee commented Jan 8, 2020

cjacoby commented Jan 9, 2020 via email

bmcfee commented Jan 9, 2020

bmcfee commented Jan 14, 2020

bmcfee commented Jan 15, 2020

cjacoby commented Jan 17, 2020 via email

bmcfee commented Jan 31, 2024

Alternate distributions for StochasticMux #148

Alternate distributions for StochasticMux #148

Comments

bmcfee commented Jul 16, 2019

bmcfee commented Jul 17, 2019

lostanlen commented Jul 17, 2019

bmcfee commented Jul 17, 2019

cjacoby commented Jul 18, 2019

bmcfee commented Jul 19, 2019

bmcfee commented Jan 8, 2020

cjacoby commented Jan 8, 2020 via email

bmcfee commented Jan 8, 2020

cjacoby commented Jan 9, 2020 via email

bmcfee commented Jan 9, 2020

bmcfee commented Jan 14, 2020

bmcfee commented Jan 15, 2020

cjacoby commented Jan 17, 2020 via email

bmcfee commented Jan 31, 2024