-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uniform stratified sampling of alts #66
base: dev
Are you sure you want to change the base?
Conversation
I figured out how to perform stratified sampling while generating the entire universe of alternatives for all observations at once, and included the fix in the latest commit here. Just had to reorder the observation ids to repeat in sequence (e.g. [1,2,3,1,2,3,1,2,3]) instead of in order (e.g. [1,1,1,2,2,2,3,3,3]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this idea. I think it is a easy and powerful way to paper over problems where the distribution of alternatives is changing over time or changing between fitting the model and running it.
If stratified sampling is specified as the sampling regime, then a column name | ||
from the alternatives table must be provided upon which stratification will be | ||
based. Because equal numbers of samples will be drawn from within each strata, | ||
the strata should be distributed roughly evenly across the population, e.g. if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as I have used the same sampling strategy to fit the model, Why does it need to be distributed roughly evenly across the population
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only because I wrote the stratification code such that it defines one value for samp_size_per_strata
(L388) and uses this number to sample from each stratum. A more generalized version of the code would compute strata proportions from the universe of alternatives and then sample from each strata accordingly such that the strata proportions in the sample match those from the universe, but I didn't code that up.
|
||
|
||
# STRATIFIED SAMPLING OF ALTS: for now we are only supporting stratified sampling | ||
# with replacement and no sampling weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this restriction needed? How hard is it to support the other configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not hard, just wasn't needed for my use case so I didn't spend the time to write the extra code.
This looks great, thanks Max! I feel like the most common use case for other folks is going to be using strata to overweight or underweight particular categories of alternatives. So we should make sure we can add support for that later without too much trouble (i'm mostly thinking about the API here, since that's harder to change than the implementation details). Looks like a nice way to support that would be to add a parameter called To do before merging
|
@smmaurer so that def prob is the most common use case for folks, however, I'm not sure we necessarily want to give them that ability until we also implement functionality to add a correction term to the probabilities, right? Unless we do that, we'd only be giving them the ability to generate bad estimates, no? What we can do is loosen the restriction on strata being evenly distributed by simply calculating the strata population proportions on-the-fly and then sample from the strata accordingly, which would obviate the need for an additional |
AFAICT No, If the same bias was used in estimating the coefficients then it should work correctly. |
I feel like it would be fine to allow stratified sampling of alternatives before we implement the correction for MNL estimation. Honestly, I'm still not convinced that a correction will have any practical effect in most cases, or even be appropriate..
|
Responding to your points, @maurer:
The reason why I'm convinced we need a correction factor is that McFadden's derivation of the MNL basically says you need one UNLESS the positive conditioning property and uniform conditioning property are met, which really only holds true under simple random sampling of alts. |
Max and i just chatted about this in person, and i agree that it's important to implement the sampling correction, but i also suspect the effect will often be negligible.
Ideally we want to know someone's true choice set, i think, which in housing/transport situations will have budget and accessibility constraints. What McFadden shows is that random sampling is unbiased compared to the full universe, which is kind of a separate issue. If we're using weights to indicate which alternatives are more or less likely to be in the choice set, that seems ok to me. We're still sampling randomly from our best guess of the full choice set.
Max is right! It's on page 4, and might help us implement the sampling correction.
My reading of the figures is that they show the sampling correction becoming much less important after there are at least a few dozen alternatives. But presumably this is pretty situation-dependent, so including the correction does seem safer. |
So, as long as this ticket is still open, I can probably get the proportional strata implemented, to accommodate strata that are not evenly distributed but still sampled randomly and proportionally? Then we can open an issue to implement importance sampling? |
Implemented a new feature of the
MergedChoiceTable
class for performing uniform stratified sampling of alternatives. For use cases that require heavy sampling of alternatives to make the problem tractable (e.g. location choice models), we want to make sure that individual choice sets are still representative. Two new arguments/attributes to theMergedChoiceTable
class,sampling_regime
andstrata
, allow the user to trigger stratified sampling and to specify the column from the table of alternatives defines the strata groupings. Because we want each observation to have a representative choice set of alternatives, we have to iterate over each observation and generate choice sets one at a time. As a result, this new sampling method is just as slow as the various sampling without replacement methods even though we are sampling with replacement at a macro level. I think there may be a better way of doing this, whereby the choice sets for all observations can be constructed in one go, without iterating over observations, but I'll leave that for the time being.