uniform stratified sampling of alts #66

mxndrwgrdnr · 2019-08-23T19:41:31Z

Implemented a new feature of the MergedChoiceTable class for performing uniform stratified sampling of alternatives. For use cases that require heavy sampling of alternatives to make the problem tractable (e.g. location choice models), we want to make sure that individual choice sets are still representative. Two new arguments/attributes to the MergedChoiceTable class, sampling_regime and strata, allow the user to trigger stratified sampling and to specify the column from the table of alternatives defines the strata groupings. Because we want each observation to have a representative choice set of alternatives, we have to iterate over each observation and generate choice sets one at a time. As a result, this new sampling method is just as slow as the various sampling without replacement methods even though we are sampling with replacement at a macro level. I think there may be a better way of doing this, whereby the choice sets for all observations can be constructed in one go, without iterating over observations, but I'll leave that for the time being.

coveralls · 2019-08-23T19:43:25Z

Coverage increased (+0.04%) to 76.111% when pulling 035d9a4 on stratified_sampling into 54c936d on master.

…each observation

mxndrwgrdnr · 2019-08-24T15:17:50Z

I figured out how to perform stratified sampling while generating the entire universe of alternatives for all observations at once, and included the fix in the latest commit here. Just had to reorder the observation ids to repeat in sequence (e.g. [1,2,3,1,2,3,1,2,3]) instead of in order (e.g. [1,1,1,2,2,2,3,3,3]).

Eh2406

I really like this idea. I think it is a easy and powerful way to paper over problems where the distribution of alternatives is changing over time or changing between fitting the model and running it.

Eh2406 · 2019-08-25T16:26:18Z

choicemodels/tools/mergedchoicetable.py

+        If stratified sampling is specified as the sampling regime, then a column name
+        from the alternatives table must be provided upon which stratification will be
+        based. Because equal numbers of samples will be drawn from within each strata,
+        the strata should be distributed roughly evenly across the population, e.g. if


As long as I have used the same sampling strategy to fit the model, Why does it need to be distributed roughly evenly across the population?

Only because I wrote the stratification code such that it defines one value for samp_size_per_strata (L388) and uses this number to sample from each stratum. A more generalized version of the code would compute strata proportions from the universe of alternatives and then sample from each strata accordingly such that the strata proportions in the sample match those from the universe, but I didn't code that up.

Eh2406 · 2019-08-25T16:29:43Z

choicemodels/tools/mergedchoicetable.py

-
+
+        # STRATIFIED SAMPLING OF ALTS: for now we are only supporting stratified sampling
+        # with replacement and no sampling weights


Why is this restriction needed? How hard is it to support the other configs?

Not hard, just wasn't needed for my use case so I didn't spend the time to write the extra code.

smmaurer · 2019-08-26T18:15:28Z

This looks great, thanks Max!

I feel like the most common use case for other folks is going to be using strata to overweight or underweight particular categories of alternatives. So we should make sure we can add support for that later without too much trouble (i'm mostly thinking about the API here, since that's harder to change than the implementation details).

Looks like a nice way to support that would be to add a parameter called strata_weights that takes a dict or Series mapping each strata id to a proportional value. And the sampling_regime and strata parameters would be unchanged. Does this sound right? That seems completely compatible with what you've implemented here, which is perfect.

To do before merging

bump the version number -- looks like 0.2.2.dev1, and we can go ahead and release it on pypi/conda-forge soon
- setup.py
- choicemodels/__init__.py
- docs/source/index.rst
add an entry to CHANGELOG.md

mxndrwgrdnr · 2019-08-26T18:42:20Z

@smmaurer so that def prob is the most common use case for folks, however, I'm not sure we necessarily want to give them that ability until we also implement functionality to add a correction term to the probabilities, right? Unless we do that, we'd only be giving them the ability to generate bad estimates, no? What we can do is loosen the restriction on strata being evenly distributed by simply calculating the strata population proportions on-the-fly and then sample from the strata accordingly, which would obviate the need for an additional strata_weights parameter.

Eh2406 · 2019-08-26T18:57:39Z

Unless we do that, we'd only be giving them the ability to generate bad estimates, no?

AFAICT No, If the same bias was used in estimating the coefficients then it should work correctly.

smmaurer · 2019-08-26T19:58:01Z

I feel like it would be fine to allow stratified sampling of alternatives before we implement the correction for MNL estimation.

Honestly, I'm still not convinced that a correction will have any practical effect in most cases, or even be appropriate..

If you're using strata to generate more realistic choice sets, these should actually be the baseline. It's the random sample from the full universe of conceivable alternatives that would produce biased estimates. (I think this is related to Jacob's point -- we're generally sampling to assert something about the choice sets, not just for our own convenience)
Everything i've seen indicates that, in addition to making conceptual sense, over-sampling the higher utility alternatives also performs well statistically -- e.g. Lemp & Kockelman 2012
And even when a sampling correction is necessary, it generally doesn't matter once the number of alternatives is more than a few dozen -- e.g. Frejinger 2007, Jarvis 2018

mxndrwgrdnr · 2019-08-29T16:18:33Z

Responding to your points, @maurer:

I thought that McFadden's positive conditioning theory showed explicitly that random sampling of alts from the full universe shows produces unbiased estimates, no?
Unless I'm mistaken, I believe the authors of this paper do apply a correction/modification to their likelihood function as described in their methodology section:

In the first iteration of this strategic process, SRS of alternatives is used. In each iteration thereafter, alternative inclusion probabilities are set equal to the MNL choice probabilities derived from the previous iteration’s parameter estimates. The likelihood function in the second and any later iterations is updated to include the probability of choice set formation (using weights on alternatives that are proportional to the prior iteration’s choice probability estimates).
Again, unless I'm mistaken, I believe the Jarvis paper is specifically addressing the use of a correction factor in the case of "simple random choice set sampling", which is not what we're talking about with regards to strategic oversampling/undersampling. I can't say I read the whole Frejinger paper either but this sentence from the abstract stood out to me:

The results show that models including a sampling correction are remarkably better than the ones that do not

The reason why I'm convinced we need a correction factor is that McFadden's derivation of the MNL basically says you need one UNLESS the positive conditioning property and uniform conditioning property are met, which really only holds true under simple random sampling of alts.

smmaurer · 2019-08-29T20:38:46Z

Max and i just chatted about this in person, and i agree that it's important to implement the sampling correction, but i also suspect the effect will often be negligible.

I thought that McFadden's positive conditioning theory showed explicitly that random sampling of alts from the full universe shows produces unbiased estimates, no?

Ideally we want to know someone's true choice set, i think, which in housing/transport situations will have budget and accessibility constraints. What McFadden shows is that random sampling is unbiased compared to the full universe, which is kind of a separate issue. If we're using weights to indicate which alternatives are more or less likely to be in the choice set, that seems ok to me. We're still sampling randomly from our best guess of the full choice set.

Unless I'm mistaken, I believe the authors of this paper do apply a correction/modification to their likelihood function as described in their methodology section

Max is right! It's on page 4, and might help us implement the sampling correction.

The results show that models including a sampling correction are remarkably better than the ones that do not

My reading of the figures is that they show the sampling correction becoming much less important after there are at least a few dozen alternatives. But presumably this is pretty situation-dependent, so including the correction does seem safer.

mxndrwgrdnr · 2019-08-29T21:01:20Z

So, as long as this ticket is still open, I can probably get the proportional strata implemented, to accommodate strata that are not evenly distributed but still sampled randomly and proportionally? Then we can open an issue to implement importance sampling?

mxndrwgrdnr added 2 commits August 23, 2019 12:04

uniform stratified sampling of alts

e3b0796

Cleaned up docs to reflect available sampling regimes

88dbfc6

mxndrwgrdnr requested a review from smmaurer August 23, 2019 19:45

fixed stratified sampling so that it no longer needs to iterate over …

1fe664d

…each observation

mxndrwgrdnr added 2 commits August 24, 2019 11:37

fixed a bug in size of the mct after sampling

8633dee

cleaned up branch

035d9a4

Eh2406 approved these changes Aug 25, 2019

View reviewed changes

smmaurer approved these changes Aug 26, 2019

View reviewed changes

mxndrwgrdnr closed this Aug 26, 2019

mxndrwgrdnr reopened this Aug 26, 2019

smmaurer changed the base branch from master to dev February 16, 2021 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uniform stratified sampling of alts #66

uniform stratified sampling of alts #66

mxndrwgrdnr commented Aug 23, 2019

coveralls commented Aug 23, 2019 •

edited

Loading

mxndrwgrdnr commented Aug 24, 2019

Eh2406 left a comment

Eh2406 Aug 25, 2019

mxndrwgrdnr Aug 26, 2019

Eh2406 Aug 25, 2019

mxndrwgrdnr Aug 26, 2019

smmaurer commented Aug 26, 2019

mxndrwgrdnr commented Aug 26, 2019

Eh2406 commented Aug 26, 2019

smmaurer commented Aug 26, 2019

mxndrwgrdnr commented Aug 29, 2019

smmaurer commented Aug 29, 2019

mxndrwgrdnr commented Aug 29, 2019



		# STRATIFIED SAMPLING OF ALTS: for now we are only supporting stratified sampling
		# with replacement and no sampling weights

uniform stratified sampling of alts #66

Are you sure you want to change the base?

uniform stratified sampling of alts #66

Conversation

mxndrwgrdnr commented Aug 23, 2019

coveralls commented Aug 23, 2019 • edited Loading

mxndrwgrdnr commented Aug 24, 2019

Eh2406 left a comment

Choose a reason for hiding this comment

Eh2406 Aug 25, 2019

Choose a reason for hiding this comment

mxndrwgrdnr Aug 26, 2019

Choose a reason for hiding this comment

Eh2406 Aug 25, 2019

Choose a reason for hiding this comment

mxndrwgrdnr Aug 26, 2019

Choose a reason for hiding this comment

smmaurer commented Aug 26, 2019

To do before merging

mxndrwgrdnr commented Aug 26, 2019

Eh2406 commented Aug 26, 2019

smmaurer commented Aug 26, 2019

mxndrwgrdnr commented Aug 29, 2019

smmaurer commented Aug 29, 2019

mxndrwgrdnr commented Aug 29, 2019

coveralls commented Aug 23, 2019 •

edited

Loading