Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative methods of collecting an axis_iter to ndarray matrix #249

Closed
kernelmachine opened this issue Dec 14, 2016 · 16 comments
Closed

Comments

@kernelmachine
Copy link

kernelmachine commented Dec 14, 2016

I'm working on a dataframe implementation that provides two-dimensional iterator adaptors over ndarray matrices.

pub struct DataFrame {
    pub columns: Vec<OuterType>,
    pub data: Matrix<InnerType>,
    pub index: Vec<OuterType>,
}

The dataframe's data are an enum over something called InnerType, which allows the dataframe to support a variety of types, like dataframes in other languages:

pub enum InnerType {
    Float(f64),
    Int64(i64),
    Int32(i32),
    Str(String),
    Empty,
}

The iterator adaptors impl Iterator<Item = (OuterType, ArrayView<'a, InnerType, usize>)>.

Notice the InnerType::Str(String). Because of this value, InnerType is not Copy, and I'm unable to collect the adaptors' items into a DataFrame via stack. Can you help me think of another way to collect the iterator adaptor into an ndarray matrix, without needing Copy, so I can support Strings in the dataframe? This problem may also affect implementing something like FromCSV, which would go from a CSV reader iterator to a DataFrame.

If you want to check out the project further, you can do so here: https://github.com/pegasos1/rust-dataframe

@bluss
Copy link
Member

bluss commented Dec 14, 2016

stack could probably support it, it's just a lot more work to do non-Copy (Copy => no destructor, no ownership semantics). The unknown factor is if there's a perf loss with the new implementation.

@kernelmachine
Copy link
Author

IIRC stack preallocates and then uses assign. What methods can we use for a move?

@kernelmachine
Copy link
Author

Also yes, I think it warrants testing, but I'm not sure if the original function should be reimplemented. Any perf loss by using non-Copy is probably only worth it for specific use cases like mine. In that case, maybe we should have a totally new non-Copy stack function.

@bluss
Copy link
Member

bluss commented Dec 14, 2016

Extending an array is not cheap in general (due to flexible memory layout) so for a new stack I would consider writing the operation using Vec::extend and only making a Vecan array out of the data at the end of the operation.

@bluss
Copy link
Member

bluss commented Dec 14, 2016

I'd really recommend to somehow use the native types as the array element type. I.e use an array of f64, not an array of item type. That lifts the item type enum up to be around the array. It's probably not as neat to write, but it is a whole lot more efficient for numerical operations.

@kernelmachine
Copy link
Author

kernelmachine commented Dec 15, 2016 via email

@bluss
Copy link
Member

bluss commented Dec 15, 2016

Oh, I didn't think about that, sorry. I've been thinking that a data frame would fix each "column" to a particular type.

@kernelmachine
Copy link
Author

kernelmachine commented Dec 15, 2016 via email

@bluss
Copy link
Member

bluss commented Jan 24, 2017

parallelization of .map() -> Array (as "par_map", not yet merged) touched upon this kind of thing.

@bluss
Copy link
Member

bluss commented Jan 29, 2017

Have you seen this? Might be a good resource for ideas

http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/

@SuperFluffy
Copy link
Contributor

In the discussion on reddit about the Utah dataframe library for Rust, the Pandas 2.0 rewrite was mentioned. Looks like Pandas is suffering form performance issues because of the way they chose to implement their original datastructures.

Maybe good to keep in mind. :-)

@bluss
Copy link
Member

bluss commented Jan 29, 2017

That is indeed the same library that @pegasos1 started this issue with. I hadn't seen the reddit post though, so thanks for the link.

@kernelmachine
Copy link
Author

Yeah, both utah and pandas suffer from copies. However, in utah the copies are only necessary because the iterator needs to own the elements it returns. I think building the dataframe around streaming iterators would solve the issue, but haven't gotten around to looking into it yet. @bluss i've seen traces of your thoughts on streaming iterators on various forums.

@bluss
Copy link
Member

bluss commented Jan 30, 2017

@bluss
Copy link
Member

bluss commented Apr 2, 2017

Is this still an issue? We need to formulate the solution here.

@bluss
Copy link
Member

bluss commented May 13, 2021

Stack and append support Clone elements (from ndarray 0.15.2), #932

@bluss bluss closed this as completed May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants