Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some issues with sampling #879

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Fix some issues with sampling #879

wants to merge 5 commits into from

Conversation

ararslan
Copy link
Member

@ararslan ararslan commented Jul 5, 2023

This PR is separated into three distinct changes:

  1. Fixing sample fails to throw error for inconsistent lengths #871 and SegFault with efraimidis_a_wsample_norep! #877
  2. NFC to use testsets in weighted sampling tests
  3. Fix weighted sampling with UnitWeights

See commit messages for additional info as applicable.

Fixes #871
Fixes #877

Currently sampling functions need to perform the same set of checks on
the inputs and those checks are copied and pasted for each method. We
can instead define a simple input validation function that can be used
by all sampling functions so that any additional corner cases that need
to be caught can be fixed in one place and propagated elsewhere.

Relatedly, this adds checks for agreement between the length of the
source array to be sampled and the array of weights (issue 871) as well
as that the destination array is not larger than the source when
sampling without replacement (issue 877).
Not all `AbstractWeights` subtypes have that field, e.g. `UnitWeights`,
but all have indexing defined, so that can be used instead of trying to
index into the underlying array.
src/sampling.jl Outdated
n = length(a)
length(wv) == n || throw(DimensionMismatch("a and wv must be of same length (got $n and $(length(wv)))."))
k = length(x)
n, k = _validate_sample_inputs(a, wv, x, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function really part of the official API and needs checks of the arguments? IIRC I had never intended it to be called by any user directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if users use the (IMO) intended sample API then the arguments are already checked I assume.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have thought it was intended to be user-facing at all except, as pointed out in #876, it's included in the manual. 😕

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not exported, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, no. That said, there are implementations of three different Efraimidis-Spirakis algorithms (A, A-Res, and AExpJ), only one of which (AExpJ) is actually used internally by a function like sample. That suggests to me that there was the intention of use of these outside of the context sample but I could very well be mistaken as I don't know the history.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs were added in #254.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually summer 2016 and you reviewed it

LOL amazing. My brain runs the GC often so 7 years ago is long gone.

I definitely buy the argument that the separate, non-exported functions that each implement specific algorithms should not be considered user-facing and thus shouldn't need to perform the same kind of safety checks as those intended to be called directly. What gets me nervous is that there's nothing saying they aren't user-facing, hence issues like #876 and #877. Perhaps we could add admonitions to the docstrings, e.g.

!!! note
    This function is not intended to be called directly and is not considered
    part of the package's API.

?

A bit tangential to this discussion but in the future we could do something for sampling algorithms as is done for sorting algorithms in Base: each algorithm gets a type that subtypes some abstract sampling algorithm type then the user may select a particular algorithm via a keyword argument to sample, e.g. sample(x, wv; alg=EfraimidisAExpJ()), and internally that dispatches to use e.g. efraimidis_aexpj_wsample_norep! (after doing any appropriate argument checking 😄).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, passing types via an alg keyword argument would be the best API.

Better perform checks anyway, except if this means we run checks twice when called from sample. Is that the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except if this means we run checks twice when called from sample. Is that the case?

Currently yes. I can add a flag to the internal checking function that makes it a no-op if called from sample but perhaps that's more complex than necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive are the checks? Is there a noticeable performance difference between calling sample and the internal function?

The alg keyword argument seems a reasonable suggestion for future refactorings.

For the time being I would prefer adding a warning or note to the docstrings of these internal functions. I think it was a mistake to add them to the docs at all (also based on the initial + follow-up PRs), so I would be fine even with just removing them from the docs. They're not exported and IMO have never been part of the official API (or at least they were not supposed to be).

src/sampling.jl Show resolved Hide resolved
src/sampling.jl Outdated Show resolved Hide resolved
src/sampling.jl Show resolved Hide resolved
src/sampling.jl Outdated Show resolved Hide resolved
src/sampling.jl Outdated Show resolved Hide resolved
sample(rng::AbstractRNG, a::AbstractArray, wv::AbstractWeights) = a[sample(rng, wv)]
function sample(rng::AbstractRNG, a::AbstractArray, wv::AbstractWeights)
_validate_sample_inputs(a, wv)
return a[sample(rng, wv)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that this line isn't tested.

src/sampling.jl Outdated
n = length(a)
length(wv) == n || throw(DimensionMismatch("a and wv must be of same length (got $n and $(length(wv)))."))
k = length(x)
n, k = _validate_sample_inputs(a, wv, x, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, passing types via an alg keyword argument would be the best API.

Better perform checks anyway, except if this means we run checks twice when called from sample. Is that the case?

@ParadaCarleton
Copy link
Contributor

What's holding up this PR?

@ararslan
Copy link
Member Author

What's holding up this PR?

The following, which I've not had time for:

  • Remember this PR exists
  • Determine whether executing the checks twice introduces noticeable overhead
  • If it does, don't run them twice
  • Remove some internal functions from the documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SegFault with efraimidis_a_wsample_norep! sample fails to throw error for inconsistent lengths
4 participants