Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap #1

Open
15 of 17 tasks
LukeMathWalker opened this issue Sep 16, 2018 · 39 comments
Open
15 of 17 tasks

Roadmap #1

LukeMathWalker opened this issue Sep 16, 2018 · 39 comments
Labels
Enhancement New feature or request Good first issue Good for newcomers Help wanted Extra attention is needed

Comments

@LukeMathWalker
Copy link
Member

LukeMathWalker commented Sep 16, 2018

In terms of functionality, the mid-term end goal is to achieve feature parity with the statistics routine in numpy (here) and Julia StatsBase (here).

For the next version:

  • Order statistics:
    • partialord version for quantiles methods;
  • Histograms:

For version 0.2.0:

For version 0.1.0:

@jturner314
Copy link
Member

With respect to mean, average, std: var is implemented in the main ndarray crate - would it make sense to port it here?

I think it makes sense for ndarray-stats to provide *_skipnan variants (or whatever you want to call them) of those methods. However, it would make sense to add std_axis to ndarray since ndarray already has var_axis.

For methods that are already in ndarray, we could duplicate these methods as a trait in ndarray-stats for people who want to write generic code (where the implementations just call the instance methods). I'm ambivalent on this.

I'll slowly start working on this next week and then I'll get serious the week afterwards. Could you please give me commit/PR permissions to the repository @jturner314?

Okay, that sounds good. I've given you push access. Alternatively, if you'd like to have your repo be the main one instead of this one, that would be fine with me.

@LukeMathWalker
Copy link
Member Author

Once #9 gets merged I think we are in a good position to officially release version 0.1.0 on crates.io - what do you think? @jturner314

@jturner314
Copy link
Member

I agree.

By the way, I recently came across Julia's StatsBase.jl library. It's a good source of ideas in addition to NumPy/SciPy.

@LukeMathWalker
Copy link
Member Author

Added a bunch of tests to #9 and merged 🎉 It feels like ages since I started to work on it 😅 Your contribution was extremely helpful to get it in the shape it is right now, thanks a lot @jturner314!

What do we need to do in order to release on crates.io?
I am going to open a small PR to add crate-level documentation - a couple of lines, nothing major.

@jturner314
Copy link
Member

Yay! 🎉 That was a big job; great work.

What do we need to do in order to release on crates.io?

Ideally, we'd eliminate the [patch.crates-io] section from the Cargo.toml before we can release on crates.io. (This might even be required, I'm not sure.) #11 removed the patch for noisy_float, but a new version of ndarray will need to be released for us to remove its patch. It would be nice to merge a couple more ndarray PRs before release; I'll take a look.

It would also be good to merge #12 and #13 before releasing.

@LukeMathWalker
Copy link
Member Author

Merged #12 and #13 - looking around it seems we can publish with [patch.crates-io] section in Cargo.toml, but I agree it is much nicer to point to ndarray 0.12.1 as a dependency instead of a revision on master.

Let's wait for that release and then we are good to go.

@jturner314
Copy link
Member

ndarray-stats 0.1.0 is now on crates.io. 🎉 Thanks for all your hard work @LukeMathWalker!

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Nov 21, 2018

💯 💯 I think it's safe to say it would have never got there without your help 😛
I'll drop a post on r/rust as well 👍

@LukeMathWalker
Copy link
Member Author

I have drafted a tentative roadmap with the features I'd like to add in the next release - please edit it with your comments and suggestions @jturner314

@jturner314
Copy link
Member

The roadmap looks good to me. I'm not familiar with the applications of higher order central moments (I'd usually use a histogram instead), but I don't mind adding them if people find them useful.

By the way, I invited you as an owner for the ndarray-stats crate, but I just realized that crates.io may not have sent the invitation if you haven't logged in before. Please let me know if you need me to re-send it.

@LukeMathWalker
Copy link
Member Author

Somehow I didn't receive an email notification, but the invite was on my dashboard - accepted it!

The main objective in that area is getting kurtosis and skewness, and given the kind of computation required to achieve that it makes sense to also roll out higher order central moments I'd say :)

@phungleson
Copy link
Contributor

Hey mate, argmin / argmax looks like simple enough to look into, do you have any suggestions of where to start?

@jturner314
Copy link
Member

Thanks for your interest! You'll want to add argmin and argmax methods to the QuantileExt trait and implement them. Please include documentation for the methods and some tests (in tests/quantile.rs).

I'd suggest starting with the existing implementation for min as a basis, but using .indexed_iter().fold() or .indexed_iter().try_fold() instead of .fold().

It would also be good to add argmin_skipnan and argmax_skipnan methods (analogous to min_skipnan and max_skipnan, but that's not necessary for the first PR.

Please feel free to ask if you have any questions.

@phungleson
Copy link
Contributor

Hey mates, I have added argmin_skipnan and argmax_skipnan, wonder why you use PartialOrd for min, but Ord for min_skipnan?

And what does this mean by this? partialord version for quantiles

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Mar 12, 2019

It's because we require the data type to be MaybeNan: it basically means that, apart from a subset of elements (e.g. NaN for floats), we are dealing with a data type that is totally ordered (all pairs of elements can be compared, Ord).

This reduces the failure scope:

  • min can return None is a comparison fails (as it can happen, with PartialOrd) or if there is no element in the array.
  • min_skipnan returns None if and only if the array has no not-NaN element (because no comparison will be undefined).

This can be useful when you are dealing with floats or arrays with potentially missing values (e.g. Option<A>, where A: Ord).

Re: quantiles - the current implementation requires A to implement Ord. We'd like to relax it to allow A to be PartialOrd instead of Ord.

@phungleson
Copy link
Contributor

Thanks @LukeMathWalker for the last point, if we change A: Ord to A: PartialOrd and refactor the code + test to allow that change, it would complete the task right?

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Mar 16, 2019

Exactly! @phungleson
I'd suggest you to wait until #26 is merged before tackling this task, otherwise you are in for some nasty merge conflicts 😛 I am almost there, I am just investigating some stack overflow errors in the revised version I have been writing.

@phungleson
Copy link
Contributor

Cool thanks @LukeMathWalker so seems like everything is more or less complete? Let me know if there are any doable features, cheers.

BTW merge method; seems to be straight forward but do you have any thoughts yet about the implementation?

@phungleson
Copy link
Contributor

For merge I read quickly, so basically just adding the weights?

for h in others
  target.weights .+= h.weights
end

@LukeMathWalker
Copy link
Member Author

Yes @phungleson, it basically boils down to summing together the weight matrices (plus or minus checking that their dimension/bins are compatible, I haven't looked into it). If you want to give it try, please go ahead!

@LukeMathWalker
Copy link
Member Author

I'd like to close existing work streams and cut a release - what does your bandwidth look like @jturner314 to review open PRs?

@jturner314
Copy link
Member

I've been meaning to look over the open PRs but haven't had a chance. I'll reserve time on Sunday to review them.

@LukeMathWalker
Copy link
Member Author

It seems I managed to publish 0.2.0 without making a mess 💪
Thanks @jturner314 @phungleson and @munckymagik for all the work done on this release ❤️

I'd say we have done a major leap forward in terms of features - there are things that can be polished, the API design can be further improved and we can optimize the existing code, but ndarray-stats is definitely a viable solution right now 🚀

I'll clean up the parent post to move items that we didn't manage to include in this release to the roadmap for the next one. I am not sure what we should be covering next in terms of major new functionality 🤔

@munckymagik
Copy link
Contributor

Well done all 👏

@jturner314
Copy link
Member

Great job on 0.2.0 everyone!

I am not sure what we should be covering next in terms of major new functionality

A couple of ideas from StatsBase.jl:

We could also add statistical models (e.g. linear regression), but that might be best put in a separate crate.

@phungleson
Copy link
Contributor

Well done! cheers!

@munckymagik
Copy link
Contributor

A couple of ideas from StatsBase.jl:

  • Deviation functions
  • Weighted calculations (mean/std/etc.)

Unless any of you have made a start on these, I'd be interested in having a go at either, or contributing. I'll try to spend some time in the next couple of days looking at what is involved with the Deviation functions.

❓ Does anyone have any implementation suggestions other than just trying to port from StatsBase.jl?

If anyone wants to collaborate on the code then let me know.

@munckymagik
Copy link
Contributor

Ok I made a start: #41

Any advice for choosing traits bounds for the A element types? Is it ok to use Copy or do we need to support any types that would be Clone?

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Apr 18, 2019

I'd say to use clone @munckymagik

@munckymagik
Copy link
Contributor

@LukeMathWalker thanks. What led you to that decision? Is there a particular data type you've seen used in ndarrays that would need this? If so I'm thinking I might use it in the test fixtures to make sure all methods have the same bounds.

@LukeMathWalker
Copy link
Member Author

I see it as a tradeoff between convenience and generality - I am not personally aware of any "popular" numerical type that is not Copy, but the cost of weakening it to Clone is so low that I see it as safe future-proofing @munckymagik

@nilgoyette
Copy link
Contributor

nilgoyette commented Sep 13, 2019

I wanted to code a simple weighted_mean for myself then contribute it

pub fn weighted_mean<A, S>(data: &ArrayBase<S, Ix1>, weights: &[A]) -> A
where
    S: Data<Elem = A>,
    A: Float,
{
    data.iter().zip(weights).fold(A::zero(), |acc, (&d, &w)| acc + d * w)
}

but I realize that it's too simple. This code is only useful for 1D arrays, or flattened matrices/images, etc. I can change the Ix1 for a D: Dimension, so that we don't need to flatten anything. It's still a one-liner though and it doesn't offer any "axis" feature, like Numpy. I think we need 2 functions here, because they won't return the same type.

  1. n-d data with n-d weight, returns a number
  2. axis mode: n-d data with n-d weight, returns (n-1)-d array.

What do you guys had in mind?

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Sep 14, 2019

I think we need 2 functions here, because they won't return the same type.

  1. n-d data with n-d weight, returns a number
  2. axis mode: n-d data with n-d weight, returns (n-1)-d array.

What do you guys had in mind?

I think it makes perfect sense to have two functions. @nilgoyette
It is also consistent with the rest of the API: we have mean and mean_axis, var and var_axis, etc. 👍

@aeroaks
Copy link

aeroaks commented Nov 18, 2019

I was thinking of picking up the histogram merge method.
I am relatively new to rust and ndarray. With this exercise, I want to pick up ndarray, rust and also start contributing to ndarray-* libraries.
What do you guys think? Or are there more high level good first issue in other ndarray-* libraries?

@munckymagik
Copy link
Contributor

@aeroaks I'd say go for it 💯

You could raise a draft PR if you get something working and want some early feedback.

@RolfStierle
Copy link

I would like to implement something like scipy.stats.binned_statistic_dd based on ndarray_stats::histogram::Histogram, allowing to caluclate running means, variances, sums, max/min value in each bin.
Would that be of interest?

@LukeMathWalker
Copy link
Member Author

It does! @RolfStierle

@lebensterben
Copy link
Contributor

According to https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges

Some of bins building strategies are not implemented by rust-ndarray now:

  • doane
  • scott
  • stone

@humphreylee
Copy link

Thanks very much for sharing the good work. Would it be possible to add univariate, bivariate and multivariate kernel density estimation functions? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request Good first issue Good for newcomers Help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants