Fix weighted sampling without replacement #239

bkamins · 2017-02-15T21:55:07Z

A proposal to fix #238. The original article assumes positive weights, so I propose to skip zero weights.
Additionally it is now strictly checked if there are not less positive weights in wv as required sample size.

A proposal to fix #238. The original article assumes positive weights, so I propose to skip zero weights. Additionally it is now strictly checked if there are not less positive weights in `wv` as required sample size.

ararslan · 2017-02-16T01:53:17Z

It would be great if you could add a test.

nalimilan · 2017-02-16T10:50:30Z

src/sampling.jl

+    i = 0
+    s = 0
+    @inbounds for s in 1:n
+        if wv.values[s] > 0.0


> 0 should be enough, no?

nalimilan · 2017-02-16T10:50:57Z

src/sampling.jl

    end
+    i < k && throw(DimensionMismatch("wv must have at least $k positive entries (got $i)"))


"strictly positive"

nalimilan · 2017-02-16T10:51:51Z

src/sampling.jl

        w = wv.values[i]
+        w > 0.0 || continue


Same here. Would be more consistent to use > 0 since <= 0 is used below.

nalimilan · 2017-02-16T10:53:32Z

So now negative weights are silently ignored? Maybe better raise an error for negative weights, which do not have any meaning at the moment? Should also be mentioned in the docs.

bkamins · 2017-02-16T19:48:13Z

I have added tests and corrected code following @nalimilan suggestions. I did not update docs as I was unsure what would be the best approach to do it (other sample variants do not perform checking for negative values of weights).
Maybe simply stating that in general the behavior of sample with negative weights is unspecified would be best?

nalimilan · 2017-02-16T20:20:36Z

If other methods don't do the checking, I would just file an issue about this or make a PR so that all methods are consistent. Documenting an unspecified behavior doesn't sound too useful.

bkamins · 2017-02-16T22:28:53Z

That's why I thought that not documenting it is the best option for now.

If I understand the other methods correctly, in particular sample(wv::WeightVec), they do not check for negative entries, because it would have performance penalty (they use precomputed sum in WeightVec and do not always visit every element in wv).

And the question is do we want to pay this performance penalty (e.g. I understood that the constructor of WeightVec deliberately does not check for non-negativity because of performance reasons and it would be most natural to make this check in the constructor).

nalimilan · 2017-02-17T09:11:25Z

Thanks for checking. One issue is that negative weights might make sense for some applications but not others, so checking when constructing the weights vector may not be possible. But you're right that for some functions the cost of the check might be significant; could you point me at the method which doesn't go over all individual weights?

I'm not sure what to do. I'd like to hear others' opinions about this.

bkamins · 2017-02-17T13:40:33Z

As I have mentioned it is in sample(wv::WeightVec). Here is the line:
https://github.com/JuliaStats/StatsBase.jl/blob/master/src/sampling.jl#L354
This function gets called in probably 90% of cases of usage of weighted sampling with replacement (alias method is chosen only for large tasks which are probably rare).

My opinion is that out of the options I have mentioned WeightVec should enforce non-negativity as this is one-time effort (no need to check it later every time WeightVec is used).

One option is to add an argument nonnegative to WeightVec constructor defaulting to true. If it is true then only non-negative weights are allowed. User could explicitly set this argument to false to allow negative weights (and documentation should warn user that most of the methods in StatsBase assume nonnegative weights).

What are uses of negative weights where the sum of weights is relevant? (this is what WeightVec effectively provides now)
The uses of negative weights I know of are in graph contexts but I would not assume that someone would use WeightVec there.

bkamins · 2017-04-23T12:12:05Z

@nalimilan Some time has passed and there is no feedback. The algorithm of Efraimidis & Spirakis does not support negative weights anyway.

nalimilan

Sorry for the delay. Looks good to me apart from one detail I spotted.

nalimilan · 2017-05-02T14:20:47Z

test/sampling.jl

@@ -149,3 +149,29 @@ check_sample_norep(a, (3, 12), 0; ordered=false)

 a = sample(3:12, 5; replace=false, ordered=true)
 check_sample_norep(a, (3, 12), 0; ordered=true)
+
+# test of weighted sampling without replacement
+import StatsBase: sample


This isn't needed.

bkamins · 2017-05-03T21:44:46Z

@nalimilan fixed

nalimilan · 2017-05-04T08:13:43Z

Thanks!

nalimilan · 2017-05-06T12:59:26Z

BTW, if you're interested in this part of the code, it seems we forgot to mention them in the list of algorithms provided in the docs at docs/source/sampling.rst.

nalimilan · 2022-09-03T21:50:55Z

Oops, we had completely forgotten about the need to find a general fix for other methods. I've found a solution which doesn't incur an unnecessary cost when constructing weights vectors if you never use them in a context that needs to check that all entries are positive. See #834.

Fix #238

60cbcc9

A proposal to fix #238. The original article assumes positive weights, so I propose to skip zero weights. Additionally it is now strictly checked if there are not less positive weights in `wv` as required sample size.

nalimilan reviewed Feb 16, 2017

View reviewed changes

src/sampling.jl Outdated

i = 0

s = 0

@inbounds for s in 1:n

if wv.values[s] > 0.0

Copy link

Member

nalimilan Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> 0 should be enough, no?

nalimilan reviewed Feb 16, 2017

View reviewed changes

weighted sampling without replacement: added tests and small fixes

5ed66f3

nalimilan reviewed May 2, 2017

View reviewed changes

removed redundant import in tests

77ec83a

ararslan approved these changes May 4, 2017

View reviewed changes

nalimilan merged commit 5f8ef16 into JuliaStats:master May 4, 2017

nalimilan changed the title ~~Fix #238~~ Fix weighted sampling without replacement May 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix weighted sampling without replacement #239

Fix weighted sampling without replacement #239

bkamins commented Feb 15, 2017

ararslan commented Feb 16, 2017

nalimilan Feb 16, 2017

nalimilan Feb 16, 2017

nalimilan Feb 16, 2017

nalimilan commented Feb 16, 2017

bkamins commented Feb 16, 2017

nalimilan commented Feb 16, 2017

bkamins commented Feb 16, 2017

nalimilan commented Feb 17, 2017

bkamins commented Feb 17, 2017

bkamins commented Apr 23, 2017

nalimilan left a comment

nalimilan May 2, 2017

bkamins commented May 3, 2017

nalimilan commented May 4, 2017

nalimilan commented May 6, 2017

nalimilan commented Sep 3, 2022

		end
		i < k && throw(DimensionMismatch("wv must have at least $k positive entries (got $i)"))

Fix weighted sampling without replacement #239

Fix weighted sampling without replacement #239

Conversation

bkamins commented Feb 15, 2017

ararslan commented Feb 16, 2017

nalimilan Feb 16, 2017

Choose a reason for hiding this comment

nalimilan Feb 16, 2017

Choose a reason for hiding this comment

nalimilan Feb 16, 2017

Choose a reason for hiding this comment

nalimilan commented Feb 16, 2017

bkamins commented Feb 16, 2017

nalimilan commented Feb 16, 2017

bkamins commented Feb 16, 2017

nalimilan commented Feb 17, 2017

bkamins commented Feb 17, 2017

bkamins commented Apr 23, 2017

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan May 2, 2017

Choose a reason for hiding this comment

bkamins commented May 3, 2017

nalimilan commented May 4, 2017

nalimilan commented May 6, 2017

nalimilan commented Sep 3, 2022