Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: possible solution to allow bootstrapping form an arbitrary number of distributions #58

Closed
wants to merge 1 commit into from

Conversation

harryscholes
Copy link

This is one possible way to implement #51. Essentially we redefine the bootstrap method, that contains the main logic, to accept a tuple and then define a separate method that wraps any non-tuple input in a tuple. Currently this is a prototype and is only implemented for BasicSampling. This allows the following:

julia> using Bootstrap, Statistics

julia> xs = rand(1000);

julia> ys = rand(1000) .+ .5;

julia> bootstrap((x,y)->mean(x)-mean(y), (xs, ys), BasicSampling(1000))
Bootstrap Sampling
  Estimates:
    │ Var │ Estimate  │ Bias        │ StdError  │
    │     │ Float64   │ Float64     │ Float64   │
    ├─────┼───────────┼─────────────┼───────────┤
    │ 1-0.507899-7.02681e-50.0124963 │
  Sampling: BasicSampling
  Samples:  1000
  Data:  

Would be interested in your thoughts!

@juliangehring
Copy link
Owner

juliangehring commented Jul 30, 2019

Thanks for taking a shot at implement this.
Redefining the bootstrap method to take a data tuple seems like a big chance to make the multi-distribution happen. I guess that most users still use a single dataset and would have to pass a tuple with one element.
Wouldn't it be easier to have a separate bootstrap method that accepts a Vector of Distributions as data? That would make for a more concrete and simpler dispatch, without breaking the existing methods.

@harryscholes
Copy link
Author

harryscholes commented Jul 30, 2019

Thanks! Yes, I suppose it might be easier if we define separate methods that operate on a vector of multiple distributions. I was trying to come up with an implementation that didn't produce too much (approximately) duplicated code. Are you suggesting we do something more like this:

# current implementation, remains as is
function bootstrap(statistic::Function, data, sampling::BasicSampling)
    # ...
end

# new implementation that bootstraps from multiple distributions
function bootstrap(statistic::Function,
                   data::AbstractVector{<:AbstractVector{T}},
                   sampling::BasicSampling) where T
    t0 = tx(statistic(data...))
    m = nrun(sampling)
    t1 = zeros_tuple(t0, m)
    data1 = copy(data)
    for i in 1:m
        draw!.(data, data1)
        for (j, t) in enumerate(tx(statistic(data1...)))
            t1[j][i] = t
        end
    end
    return NonParametricBootstrapSample(t0, t1, statistic, data, sampling)
end

@juliangehring
Copy link
Owner

Please ignore what I said about the Vectors earlier - I misunderstood what you were trying to achieve and my comment wasn't a suitable solution.
I really like the idea of sampling from multiple data sets, and your proposed solution with the Tuple looks very clean. It would be nice to do the following:

function bootstrap(statistic::Function, data, sampling::BasicSampling)
    # same as before
end

function bootstrap(statistic::Function, data::Tuple, sampling::BasicSampling)
    # new implementation that bootstraps from multiple distributions
    # where `data::Tuple` represents different data sets
    # e.g. (rand(10), rand(100))
end

What do you think?

@harryscholes
Copy link
Author

Ahh I see, no problem about the misunderstanding. So the new method that would implement the boostrapping from multiple distributions would be something like:

function bootstrap(statistic::Function, data::Tuple, sampling::BasicSampling)
    t0 = tx(statistic(data...))
    m = nrun(sampling)
    t1 = zeros_tuple(t0, m)
    data1 = copy.(data)
    for i in 1:m
        draw!.(data, data1)
        for (j, t) in enumerate(tx(statistic(data1...)))
            t1[j][i] = t
        end
    end
    return NonParametricBootstrapSample(t0, t1, statistic, data, sampling)
end

If so, I will have a go at implementing this for the other sampling strategies.

@juliangehring
Copy link
Owner

What if we used your bootstrap(statistic::Function, data::Tuple, sampling::BasicSampling) function and then define the current "standard" bootstrap function as

bootstrap(statistic::Function, data, sampling::BasicSampling) = bootstrap(statistic, tuple(data), sampling)

Wouldn't that avoid most of the code duplication? It would affect how data is stored in the BootstrapSample struct, but we could then cover both scenarios together.

@harryscholes
Copy link
Author

I think this is exactly how I implemented it in the commit on this PR. You're right that it does avoid code duplication. But it will require a bit of tinkering with existing code e.g. data_summary(::Tuple)

@moberer
Copy link

moberer commented Jan 25, 2021

Is there still any interest in getting this feature working?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants