Fix some issues with sampling #879

ararslan · 2023-07-05T20:59:49Z

This PR is separated into three distinct changes:

Fixing sample fails to throw error for inconsistent lengths #871 and SegFault with efraimidis_a_wsample_norep! #877
NFC to use testsets in weighted sampling tests
Fix weighted sampling with UnitWeights

See commit messages for additional info as applicable.

Fixes #871
Fixes #877

Currently sampling functions need to perform the same set of checks on the inputs and those checks are copied and pasted for each method. We can instead define a simple input validation function that can be used by all sampling functions so that any additional corner cases that need to be caught can be fixed in one place and propagated elsewhere. Relatedly, this adds checks for agreement between the length of the source array to be sampled and the array of weights (issue 871) as well as that the destination array is not larger than the source when sampling without replacement (issue 877).

Not all `AbstractWeights` subtypes have that field, e.g. `UnitWeights`, but all have indexing defined, so that can be used instead of trying to index into the underlying array.

devmotion · 2023-07-05T21:05:16Z

src/sampling.jl

-    n = length(a)
-    length(wv) == n || throw(DimensionMismatch("a and wv must be of same length (got $n and $(length(wv)))."))
-    k = length(x)
+    n, k = _validate_sample_inputs(a, wv, x, false)


Is this function really part of the official API and needs checks of the arguments? IIRC I had never intended it to be called by any user directly.

And if users use the (IMO) intended sample API then the arguments are already checked I assume.

I wouldn't have thought it was intended to be user-facing at all except, as pointed out in #876, it's included in the manual. 😕

But it's not exported, is it?

It's not, no. That said, there are implementations of three different Efraimidis-Spirakis algorithms (A, A-Res, and AExpJ), only one of which (AExpJ) is actually used internally by a function like sample. That suggests to me that there was the intention of use of these outside of the context sample but I could very well be mistaken as I don't know the history.

Docs were added in #254.

It was actually summer 2016 and you reviewed it

LOL amazing. My brain runs the GC often so 7 years ago is long gone.

I definitely buy the argument that the separate, non-exported functions that each implement specific algorithms should not be considered user-facing and thus shouldn't need to perform the same kind of safety checks as those intended to be called directly. What gets me nervous is that there's nothing saying they aren't user-facing, hence issues like #876 and #877. Perhaps we could add admonitions to the docstrings, e.g.

!!! note This function is not intended to be called directly and is not considered part of the package's API.

?

A bit tangential to this discussion but in the future we could do something for sampling algorithms as is done for sorting algorithms in Base: each algorithm gets a type that subtypes some abstract sampling algorithm type then the user may select a particular algorithm via a keyword argument to sample, e.g. sample(x, wv; alg=EfraimidisAExpJ()), and internally that dispatches to use e.g. efraimidis_aexpj_wsample_norep! (after doing any appropriate argument checking 😄).

Yeah, passing types via an alg keyword argument would be the best API.

Better perform checks anyway, except if this means we run checks twice when called from sample. Is that the case?

except if this means we run checks twice when called from sample. Is that the case?

Currently yes. I can add a flag to the internal checking function that makes it a no-op if called from sample but perhaps that's more complex than necessary.

How expensive are the checks? Is there a noticeable performance difference between calling sample and the internal function?

The alg keyword argument seems a reasonable suggestion for future refactorings.

For the time being I would prefer adding a warning or note to the docstrings of these internal functions. I think it was a mistake to add them to the docs at all (also based on the initial + follow-up PRs), so I would be fine even with just removing them from the docs. They're not exported and IMO have never been part of the official API (or at least they were not supposed to be).

src/sampling.jl

nalimilan · 2023-07-09T13:46:19Z

src/sampling.jl

-sample(rng::AbstractRNG, a::AbstractArray, wv::AbstractWeights) = a[sample(rng, wv)]
+function sample(rng::AbstractRNG, a::AbstractArray, wv::AbstractWeights)
+    _validate_sample_inputs(a, wv)
+    return a[sample(rng, wv)]


It's weird that this line isn't tested.

nalimilan · 2023-07-09T13:50:14Z

src/sampling.jl

-    n = length(a)
-    length(wv) == n || throw(DimensionMismatch("a and wv must be of same length (got $n and $(length(wv)))."))
-    k = length(x)
+    n, k = _validate_sample_inputs(a, wv, x, false)


Yeah, passing types via an alg keyword argument would be the best API.

Better perform checks anyway, except if this means we run checks twice when called from sample. Is that the case?

Co-authored-by: Milan Bouchet-Valat <[email protected]>

ParadaCarleton · 2023-08-22T19:09:44Z

What's holding up this PR?

ararslan · 2023-08-24T00:30:34Z

What's holding up this PR?

The following, which I've not had time for:

Remember this PR exists
Determine whether executing the checks twice introduces noticeable overhead
If it does, don't run them twice
Remove some internal functions from the documentation

ararslan added 3 commits July 5, 2023 13:30

Use testsets and loops to avoid copy-pastas in test/wsampling.jl

d8dfe60

Don't access weight vector .values unnecessarily

d56a40a

Not all `AbstractWeights` subtypes have that field, e.g. `UnitWeights`, but all have indexing defined, so that can be used instead of trying to index into the underlying array.

ararslan requested review from devmotion and nalimilan July 5, 2023 20:59

devmotion reviewed Jul 5, 2023

View reviewed changes

nalimilan reviewed Jul 9, 2023

View reviewed changes

ararslan and others added 2 commits July 10, 2023 08:38

Improve error message

4a1e261

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Don't have _validate_sample_inputs return lengths

faa481f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some issues with sampling #879

Fix some issues with sampling #879

ararslan commented Jul 5, 2023

devmotion Jul 5, 2023

devmotion Jul 5, 2023

ararslan Jul 5, 2023

devmotion Jul 5, 2023

ararslan Jul 5, 2023

devmotion Jul 5, 2023

ararslan Jul 6, 2023

nalimilan Jul 9, 2023

ararslan Jul 10, 2023

devmotion Jul 10, 2023

nalimilan Jul 9, 2023

nalimilan Jul 9, 2023

ParadaCarleton commented Aug 22, 2023

ararslan commented Aug 24, 2023

Fix some issues with sampling #879

Are you sure you want to change the base?

Fix some issues with sampling #879

Conversation

ararslan commented Jul 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParadaCarleton commented Aug 22, 2023

ararslan commented Aug 24, 2023