Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document bias and entropy-exhausted behavior #184

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Jul 8, 2024

  1. Document bias and behavior when running out of entropy

    `choose` and `choose_iter` incorrectly claimed to return `Error::NotEnoughData`
    when they in fact default to the first choice. This also documents
    that default in various other APIs.
    
    Additionally, `int_in_range` (and APIs that rely on it) has bias
    for non-power-of-two ranges.
    `u.int_in_range(0..=170)` for example will consume one byte of entropy,
    and take its value modulo 171 (the size of the range)
    to generate the returned integer.
    As a result, values in `0..=84` (the first ~half of the range)
    are twice as likely to get chosen as the rest
    (assuming the underlying bytes are uniform).
    In general, the result distribution is only uniform if the range size
    is a power of two (where the modulo just masks some bits).
    
    It would be accurate to document that return values
    are biased towards lower values when the range size is not a power of two,
    but do we want this much detail in the documented “contract” of this method?
    
    Similarly, I just called `ratio` “approximate”. `u.ratio(5, 7)` returns true
    for 184 out of 256 possible underlying byte values, ~0.6% too often.
    In the worst case, `u.ratio(84, 170)` return true ~33% too often.
    
    Notably, `#[derive(Arbitrary)]` chooses enum variants not with `choose_index`
    (although that seems most appropriate from reading `Unstructured` docs)
    but by always consuming 4 bytes of entropy:
    
    ```rust
    // Use a multiply + shift to generate a ranged random number
    // with slight bias. For details, see:
    // https://lemire.me/blog/2016/06/30/fast-random-shuffling
    Ok(match (u64::from(<u32 as arbitrary::Arbitrary>::arbitrary(u)?) * #count) >> 32 {
        #(#variants,)*
        _ => unreachable!()
    })
    ```
    
    `int_in_range` tries to minimize consumption based on the range size
    but that contributes to having more bias than multiply + shift.
    Is this a real trade-off worth having two methods?
    SimonSapin committed Jul 8, 2024
    Configuration menu
    Copy the full SHA
    b8e9c06 View commit details
    Browse the repository at this point in the history