-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building a 2D ndarray from rows or columns #539
Comments
Do you have any specific examples where this functionality would have been useful? I'm not necessarily opposed to it; I'm trying to get a better understanding of the use-case and if there's a better alternative. |
Sure! One use case was in reading a CSV file into a 2D ndarray. The other use case, which I don't have public code for, was in implementing the forward-backward algorithm for a hidden Markov model. However, now that I think about it, I probably should use So, this may be a bit of a niche use case. |
Yeah, in most cases, I would recommend using // Place in src/impl_2d.rs
impl<A, S> ArrayBase<S, Ix2>
where
S: Data<Elem = A>,
{
pub fn from_rows<I>(rows: I) -> Result<Self, ShapeError>
where
I: IntoIterator,
I::Item: IntoIterator<Item = A>,
S: DataOwned,
{
unimplemented!()
}
} We'd need to implement I think this method is sufficiently useful to add. It would handle determining the shape of the array for you and checking that all of the rows are the same length. If you don't mind me making a suggestion for the CSV method, I'd recommend this instead: impl From<csv::Error> for ReadError {
fn from(err: csv::Error) -> ReadError {
ReadError::Csv(err)
}
}
impl<'a, R: Read> Array2Reader for &'a mut Reader<R> {
fn deserialize_array2<A: DeserializeOwned>(
self,
shape: (usize, usize),
) -> Result<Array2<A>, ReadError> {
let (n_rows, n_columns) = shape;
let mut data = Vec::with_capacity(n_rows * n_columns);
for (row_index, row) in self.deserialize().enumerate() {
let mut row: Vec<A> = row?;
if row.len() != n_columns {
return Err(ReadError::NColumns {
at_row_index: row_index,
expected: n_columns,
actual: row.len(),
});
}
data.append(&mut row);
}
let data_len = data.len();
Array2::from_shape_vec(shape, data).map_err(|err| match err.kind() {
ndarray::ErrorKind::IncompatibleShape => ReadError::NRows {
expected: n_rows,
actual: data_len / n_columns,
},
_ => unreachable!(),
})
}
} just because it's easier for me to understand. (It took me a while to figure out what the |
Maybe someone wants to publish a crate that can wrap the whole read csv file into ndarray? I know I've implemented the same thing, but some of the corner cases are tricky. (And most cases will ask for mixed data types which ndarray doesn't really handle.. let's find a good solution for data frames 😄 ) |
I think @paulkernfeld is working on this. (
Rows having unequal lengths and deserialization errors are the only corner cases that immediately come to mind for me. Are there any other corner cases to worry about?
IIRC, you were working on a data frame project a while ago. How did that go? |
@jturner314 I see, that makes sense. I guess my tricky one was guessing the type of each column. :) The data frame that was last christmas break and that was the only time I had time for such a project. I shouldn't do it :) Unless I get a very long break soon. Some troubles in that project: "Wrapping" ndarray arrays and offering the same owned/view interface. Supporting the NetCDF (HDF5 more or less) type system leads to lots of enums: an enum to wrap each possible scalar type, and a corresponding enum to again wrap an ndarray of each possible scalar type and so on. There is a dataset, and it instead uses type erasure of the underlying data arrays (each array has a uniform element type).
The project looked a bit like something that should stay in Python, especially since the user needs to supply a concrete type when they want to read the values of a specific variable in the Dataset. All of that is just even more unpleasant in general code and utility methods on these types IMO. It's all just a big WIP and doesn't do anything I can show some debugging output to show a bit about how the data structures are constructed: #[test]
fn test_simple() {
let mut a = Array::zeros((3, 4));
a[[1, 2]] = 1.;
let da = DataArray::from(a).dim_names(vec!["x", "y"]).with_name("data");
let mut ds = Dataset::from(da);
let s = String::from;
ds.attributes_mut().insert(s("title"), AttributeValue::from("test dataset"));
println!("{:#?}", ds);
println!("{:#?}", ds.variable::<f64>("data"));
} has the following output:
|
Hmm... That is quite inconvenient. Based on your experience with the challenges of a dynamic approach (enums everywhere and type erasure), I wonder if a much more static approach would make sense (using a proc macro to define specific dataframe types). So the user could define a dataframe and implement the relevant methods/traits like this: dataframe!{
MyDataframe {
attributes: {
title: String,
},
coordinates: {x: i64, y: i64, time: f64},
variables: {
temperature (x, y, time): f64,
elevation (x, y): f64,
solar_activity (time): f64,
culiv_policy (time): String,
},
}
} There would be methods on
That's understandable. :) |
That's a good idea. This pattern seems familiar in Rust.., maybe it's the way we have to do it. |
Give me dataframes in Rust and I can slowly start to use it for work ❤️ |
I've been working on a Rust Data Reader equivalent to Numpy's load_txt. You can find the current framework here: https://github.com/rcarson3/rust_data_reader. It has the various Rust primitive types supported. The data is outputted in a struct that contains the data all in a vector and the number of lines and columns read are also provided. Therefore, it should be pretty easy to write a simple wrapper to have the data imported into a ndarray type using either Edit: I should mention that contrary to the Readme it works pretty well currently. I do need to fix a few small edge cases related to how it counts the number of lines and the number of lines that are comments. However, I'd say it's pretty usable for data you trust how it's outputted and as long as the data is one of the primitive types. |
@bluss @jturner314 @LukeMathWalker I just remembered about the DataFusion project that Andy Grove was working on that had some dataframe stuff in it. It led me to a couple of posts: Rust Dataframes, Andy Grove's Post on Apache Arrow, and a reddit post on dataframes. It seems as of 6 months ago enums and some usage of traits were the various attempts used to create a dataframe library. It also seems that the Apache Arrow project now has a native Rust implementation. It appears that Arrows might be a good library/data format to use as an underlying data structure for a Dataframe based library. I'm not sure if some of these might be the best approaches for a pure Rust implementation of a dataframe library, but it's always nice to see what's already been done in the field. |
How can I do this in ndarray? fn clean_NA(df: Array2<f64>) {
let df = df.outer_iter().filter(|row| check_row_has_no_NA()).collect();
// or
let df = df.into_outer_iter_par().filter(|row| check_row_has_no_NA()).collect();
df
} Using Zip requires to know the filtered number of rows in advance. |
// The input type is
// ndarray::ArrayBase<ndarray::data_repr::OwnedRepr<alloc::string::String>, ndarray::dimension::dim::Dim<[usize; 2]>>
// clean up before parsing items to f64
fn clean(df: &Array2<String>) -> Array2<String> {
let mut out = Array2::<String>::default((0, df.shape()[1]));
for row in df.outer_iter() {
if row[5] != "NA" {
ndarray::stack![Axis(0), out, row.insert_axis(Axis(0)).clone()];
}
}
out
}
ndarray::stack![Axis(0), out, row.insert_axis(Axis(0)).clone()];
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `std::marker::Copy` is not implemented for `std::string::String`
the trait bound `std::string::String: std::marker::Copy` is not satisfied fn clean(df: &Array2<String>) -> Array2<String> {
let indices:Vec<usize> = df.outer_iter().enumerate().filter(|(idx, row)| row[5] != "NA").map(|(idx, _)| idx).collect();
let out = df.select(Axis(0), &indices[..]);
out
}
let out = df.select(Axis(0), &indices[..]);
^^^^^^ the trait `std::marker::Copy` is not implemented for `std::string::String`
the trait bound `std::string::String: std::marker::Copy` is not satisfied |
I came across a use case where I wanted to build an 2D ndarray where I had filtered some rows. Intuitievly I wanted to do something like: let filtered: Array<usize, Ix2> = table
.genrows()
.into_iter()
.filter(|r| r.iter().all(|v| v > 0))
.collect() |
I have a similar issue. I want to convert a json file that is a set of nested vectors into an ArrayD, similar how you can pass a json to I ended up building a custom function that iterates over the json to detect the shape and the build the array. It's not super efficient thought. |
@nemosupremo That makes sense to me - there's a lot we can do. |
There has been a lot of topic drift on this thread since this comment, but I want to refocus on the original question:
The best way I could think of is to collect into |
https://docs.rs/ndarray/latest/ndarray/type.Array2.html#method.from let arr: Array2<f64> = iter
.map(|row| [row.a, row.b, row.c])
.collect::<Vec<_>>()
.into(); Collect into a |
A couple different times I have wanted to construct a 2D array from an iterator of 1D arrays. Is this something that should be added to ndarray?
Roughly, I'm thinking of a function
from_rows<I: Iterator<Item=Array1>>(rows: I) -> Array2
and an analogousfrom_columns
.One design question that I'm not currently sure about is whether the user should have to specify any information about the dimensions. I am leaning towards "no," especially because it may be difficult to know the size of an iterator beforehand.
Possibly, it would make more sense to collect my iterator and use
stack
rather than adding new functionality. However, if this sounds useful, I would be happy to take a whack at implementing it.The text was updated successfully, but these errors were encountered: