Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniform stratified sampling of alts #66

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

Conversation

mxndrwgrdnr
Copy link
Member

Implemented a new feature of the MergedChoiceTable class for performing uniform stratified sampling of alternatives. For use cases that require heavy sampling of alternatives to make the problem tractable (e.g. location choice models), we want to make sure that individual choice sets are still representative. Two new arguments/attributes to the MergedChoiceTable class, sampling_regime and strata, allow the user to trigger stratified sampling and to specify the column from the table of alternatives defines the strata groupings. Because we want each observation to have a representative choice set of alternatives, we have to iterate over each observation and generate choice sets one at a time. As a result, this new sampling method is just as slow as the various sampling without replacement methods even though we are sampling with replacement at a macro level. I think there may be a better way of doing this, whereby the choice sets for all observations can be constructed in one go, without iterating over observations, but I'll leave that for the time being.

@coveralls
Copy link

coveralls commented Aug 23, 2019

Coverage Status

Coverage increased (+0.04%) to 76.111% when pulling 035d9a4 on stratified_sampling into 54c936d on master.

@mxndrwgrdnr mxndrwgrdnr requested a review from smmaurer August 23, 2019 19:45
@mxndrwgrdnr
Copy link
Member Author

I figured out how to perform stratified sampling while generating the entire universe of alternatives for all observations at once, and included the fix in the latest commit here. Just had to reorder the observation ids to repeat in sequence (e.g. [1,2,3,1,2,3,1,2,3]) instead of in order (e.g. [1,1,1,2,2,2,3,3,3]).

Copy link

@Eh2406 Eh2406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea. I think it is a easy and powerful way to paper over problems where the distribution of alternatives is changing over time or changing between fitting the model and running it.

If stratified sampling is specified as the sampling regime, then a column name
from the alternatives table must be provided upon which stratification will be
based. Because equal numbers of samples will be drawn from within each strata,
the strata should be distributed roughly evenly across the population, e.g. if
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as I have used the same sampling strategy to fit the model, Why does it need to be distributed roughly evenly across the population?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because I wrote the stratification code such that it defines one value for samp_size_per_strata (L388) and uses this number to sample from each stratum. A more generalized version of the code would compute strata proportions from the universe of alternatives and then sample from each strata accordingly such that the strata proportions in the sample match those from the universe, but I didn't code that up.



# STRATIFIED SAMPLING OF ALTS: for now we are only supporting stratified sampling
# with replacement and no sampling weights
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this restriction needed? How hard is it to support the other configs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not hard, just wasn't needed for my use case so I didn't spend the time to write the extra code.

@smmaurer
Copy link
Member

This looks great, thanks Max!

I feel like the most common use case for other folks is going to be using strata to overweight or underweight particular categories of alternatives. So we should make sure we can add support for that later without too much trouble (i'm mostly thinking about the API here, since that's harder to change than the implementation details).

Looks like a nice way to support that would be to add a parameter called strata_weights that takes a dict or Series mapping each strata id to a proportional value. And the sampling_regime and strata parameters would be unchanged. Does this sound right? That seems completely compatible with what you've implemented here, which is perfect.

To do before merging

  • bump the version number -- looks like 0.2.2.dev1, and we can go ahead and release it on pypi/conda-forge soon

    • setup.py
    • choicemodels/__init__.py
    • docs/source/index.rst
  • add an entry to CHANGELOG.md

@mxndrwgrdnr
Copy link
Member Author

@smmaurer so that def prob is the most common use case for folks, however, I'm not sure we necessarily want to give them that ability until we also implement functionality to add a correction term to the probabilities, right? Unless we do that, we'd only be giving them the ability to generate bad estimates, no? What we can do is loosen the restriction on strata being evenly distributed by simply calculating the strata population proportions on-the-fly and then sample from the strata accordingly, which would obviate the need for an additional strata_weights parameter.

@mxndrwgrdnr mxndrwgrdnr reopened this Aug 26, 2019
@Eh2406
Copy link

Eh2406 commented Aug 26, 2019

Unless we do that, we'd only be giving them the ability to generate bad estimates, no?

AFAICT No, If the same bias was used in estimating the coefficients then it should work correctly.

@smmaurer
Copy link
Member

I feel like it would be fine to allow stratified sampling of alternatives before we implement the correction for MNL estimation.

Honestly, I'm still not convinced that a correction will have any practical effect in most cases, or even be appropriate..

  1. If you're using strata to generate more realistic choice sets, these should actually be the baseline. It's the random sample from the full universe of conceivable alternatives that would produce biased estimates. (I think this is related to Jacob's point -- we're generally sampling to assert something about the choice sets, not just for our own convenience)

  2. Everything i've seen indicates that, in addition to making conceptual sense, over-sampling the higher utility alternatives also performs well statistically -- e.g. Lemp & Kockelman 2012

  3. And even when a sampling correction is necessary, it generally doesn't matter once the number of alternatives is more than a few dozen -- e.g. Frejinger 2007, Jarvis 2018

@mxndrwgrdnr
Copy link
Member Author

Responding to your points, @maurer:

  1. I thought that McFadden's positive conditioning theory showed explicitly that random sampling of alts from the full universe shows produces unbiased estimates, no?
  2. Unless I'm mistaken, I believe the authors of this paper do apply a correction/modification to their likelihood function as described in their methodology section:

    In the first iteration of this strategic process, SRS of alternatives is used. In each iteration thereafter, alternative inclusion probabilities are set equal to the MNL choice probabilities derived from the previous iteration’s parameter estimates. The likelihood function in the second and any later iterations is updated to include the probability of choice set formation (using weights on alternatives that are proportional to the prior iteration’s choice probability estimates).

  3. Again, unless I'm mistaken, I believe the Jarvis paper is specifically addressing the use of a correction factor in the case of "simple random choice set sampling", which is not what we're talking about with regards to strategic oversampling/undersampling. I can't say I read the whole Frejinger paper either but this sentence from the abstract stood out to me:

    The results show that models including a sampling correction are remarkably better than the ones that do not

The reason why I'm convinced we need a correction factor is that McFadden's derivation of the MNL basically says you need one UNLESS the positive conditioning property and uniform conditioning property are met, which really only holds true under simple random sampling of alts.

@smmaurer
Copy link
Member

Max and i just chatted about this in person, and i agree that it's important to implement the sampling correction, but i also suspect the effect will often be negligible.

I thought that McFadden's positive conditioning theory showed explicitly that random sampling of alts from the full universe shows produces unbiased estimates, no?

Ideally we want to know someone's true choice set, i think, which in housing/transport situations will have budget and accessibility constraints. What McFadden shows is that random sampling is unbiased compared to the full universe, which is kind of a separate issue. If we're using weights to indicate which alternatives are more or less likely to be in the choice set, that seems ok to me. We're still sampling randomly from our best guess of the full choice set.

Unless I'm mistaken, I believe the authors of this paper do apply a correction/modification to their likelihood function as described in their methodology section

Max is right! It's on page 4, and might help us implement the sampling correction.

The results show that models including a sampling correction are remarkably better than the ones that do not

My reading of the figures is that they show the sampling correction becoming much less important after there are at least a few dozen alternatives. But presumably this is pretty situation-dependent, so including the correction does seem safer.

@mxndrwgrdnr
Copy link
Member Author

So, as long as this ticket is still open, I can probably get the proportional strata implemented, to accommodate strata that are not evenly distributed but still sampled randomly and proportionally? Then we can open an issue to implement importance sampling?

@smmaurer smmaurer changed the base branch from master to dev February 16, 2021 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants