Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix weighted sampling without replacement #239

Merged
merged 3 commits into from
May 4, 2017
Merged

Fix weighted sampling without replacement #239

merged 3 commits into from
May 4, 2017

Conversation

bkamins
Copy link
Contributor

@bkamins bkamins commented Feb 15, 2017

A proposal to fix #238. The original article assumes positive weights, so I propose to skip zero weights.
Additionally it is now strictly checked if there are not less positive weights in wv as required sample size.

A proposal to fix #238. The original article assumes positive weights, so I propose to skip zero weights.
Additionally it is now strictly checked if there are not less positive weights in `wv` as required sample size.
@ararslan
Copy link
Member

It would be great if you could add a test.

src/sampling.jl Outdated
i = 0
s = 0
@inbounds for s in 1:n
if wv.values[s] > 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> 0 should be enough, no?

src/sampling.jl Outdated
end
i < k && throw(DimensionMismatch("wv must have at least $k positive entries (got $i)"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"strictly positive"

src/sampling.jl Outdated
w = wv.values[i]
w > 0.0 || continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Would be more consistent to use > 0 since <= 0 is used below.

@nalimilan
Copy link
Member

So now negative weights are silently ignored? Maybe better raise an error for negative weights, which do not have any meaning at the moment? Should also be mentioned in the docs.

@bkamins
Copy link
Contributor Author

bkamins commented Feb 16, 2017

I have added tests and corrected code following @nalimilan suggestions. I did not update docs as I was unsure what would be the best approach to do it (other sample variants do not perform checking for negative values of weights).
Maybe simply stating that in general the behavior of sample with negative weights is unspecified would be best?

@nalimilan
Copy link
Member

If other methods don't do the checking, I would just file an issue about this or make a PR so that all methods are consistent. Documenting an unspecified behavior doesn't sound too useful.

@bkamins
Copy link
Contributor Author

bkamins commented Feb 16, 2017

That's why I thought that not documenting it is the best option for now.

If I understand the other methods correctly, in particular sample(wv::WeightVec), they do not check for negative entries, because it would have performance penalty (they use precomputed sum in WeightVec and do not always visit every element in wv).

And the question is do we want to pay this performance penalty (e.g. I understood that the constructor of WeightVec deliberately does not check for non-negativity because of performance reasons and it would be most natural to make this check in the constructor).

@nalimilan
Copy link
Member

Thanks for checking. One issue is that negative weights might make sense for some applications but not others, so checking when constructing the weights vector may not be possible. But you're right that for some functions the cost of the check might be significant; could you point me at the method which doesn't go over all individual weights?

I'm not sure what to do. I'd like to hear others' opinions about this.

@bkamins
Copy link
Contributor Author

bkamins commented Feb 17, 2017

As I have mentioned it is in sample(wv::WeightVec). Here is the line:
https://github.com/JuliaStats/StatsBase.jl/blob/master/src/sampling.jl#L354
This function gets called in probably 90% of cases of usage of weighted sampling with replacement (alias method is chosen only for large tasks which are probably rare).

My opinion is that out of the options I have mentioned WeightVec should enforce non-negativity as this is one-time effort (no need to check it later every time WeightVec is used).

One option is to add an argument nonnegative to WeightVec constructor defaulting to true. If it is true then only non-negative weights are allowed. User could explicitly set this argument to false to allow negative weights (and documentation should warn user that most of the methods in StatsBase assume nonnegative weights).

What are uses of negative weights where the sum of weights is relevant? (this is what WeightVec effectively provides now)
The uses of negative weights I know of are in graph contexts but I would not assume that someone would use WeightVec there.

@bkamins
Copy link
Contributor Author

bkamins commented Apr 23, 2017

@nalimilan Some time has passed and there is no feedback. The algorithm of Efraimidis & Spirakis does not support negative weights anyway.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Looks good to me apart from one detail I spotted.

test/sampling.jl Outdated
@@ -149,3 +149,29 @@ check_sample_norep(a, (3, 12), 0; ordered=false)

a = sample(3:12, 5; replace=false, ordered=true)
check_sample_norep(a, (3, 12), 0; ordered=true)

# test of weighted sampling without replacement
import StatsBase: sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed.

@bkamins
Copy link
Contributor Author

bkamins commented May 3, 2017

@nalimilan fixed

@nalimilan nalimilan merged commit 5f8ef16 into JuliaStats:master May 4, 2017
@nalimilan
Copy link
Member

Thanks!

@nalimilan nalimilan changed the title Fix #238 Fix weighted sampling without replacement May 4, 2017
@nalimilan
Copy link
Member

BTW, if you're interested in this part of the code, it seems we forgot to mention them in the list of algorithms provided in the docs at docs/source/sampling.rst.

@nalimilan
Copy link
Member

Oops, we had completely forgotten about the need to find a general fix for other methods. I've found a solution which doesn't incur an unnecessary cost when constructing weights vectors if you never use them in a context that needs to check that all entries are positive. See #834.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weighted sampling without replacement
3 participants