Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a 2D ndarray from rows or columns #539

Open
paulkernfeld opened this issue Nov 13, 2018 · 18 comments
Open

Building a 2D ndarray from rows or columns #539

paulkernfeld opened this issue Nov 13, 2018 · 18 comments

Comments

@paulkernfeld
Copy link

A couple different times I have wanted to construct a 2D array from an iterator of 1D arrays. Is this something that should be added to ndarray?

Roughly, I'm thinking of a function from_rows<I: Iterator<Item=Array1>>(rows: I) -> Array2 and an analogous from_columns.

One design question that I'm not currently sure about is whether the user should have to specify any information about the dimensions. I am leaning towards "no," especially because it may be difficult to know the size of an iterator beforehand.

Possibly, it would make more sense to collect my iterator and use stack rather than adding new functionality. However, if this sounds useful, I would be happy to take a whack at implementing it.

@jturner314
Copy link
Member

Do you have any specific examples where this functionality would have been useful? I'm not necessarily opposed to it; I'm trying to get a better understanding of the use-case and if there's a better alternative.

@paulkernfeld
Copy link
Author

Sure!

One use case was in reading a CSV file into a 2D ndarray.

The other use case, which I don't have public code for, was in implementing the forward-backward algorithm for a hidden Markov model. However, now that I think about it, I probably should use Zip and genrows to accomplish this.

So, this may be a bit of a niche use case.

@jturner314
Copy link
Member

Yeah, in most cases, I would recommend using Zip/azip if that makes sense. For the CSV case, you're not actually building the 2D array from an iterator of Array1<A> items, you're building it from an iterator of Result<Arrray1<A>, _> items, right? So, I don't think from_rows would work for that case. If we do add from_rows, I'd actually suggest a method signature like this instead:

// Place in src/impl_2d.rs
impl<A, S> ArrayBase<S, Ix2>
where
    S: Data<Elem = A>,
{
    pub fn from_rows<I>(rows: I) -> Result<Self, ShapeError>
    where
        I: IntoIterator,
        I::Item: IntoIterator<Item = A>,
        S: DataOwned,
    {
        unimplemented!()
    }
}

We'd need to implement IntoIterator for Array1<A>, then this modified from_rows would work for iterators with Vec<A> items as well as Array1<A> items.

I think this method is sufficiently useful to add. It would handle determining the shape of the array for you and checking that all of the rows are the same length.

If you don't mind me making a suggestion for the CSV method, I'd recommend this instead:

impl From<csv::Error> for ReadError {
    fn from(err: csv::Error) -> ReadError {
        ReadError::Csv(err)
    }
}

impl<'a, R: Read> Array2Reader for &'a mut Reader<R> {
    fn deserialize_array2<A: DeserializeOwned>(
        self,
        shape: (usize, usize),
    ) -> Result<Array2<A>, ReadError> {
        let (n_rows, n_columns) = shape;
        let mut data = Vec::with_capacity(n_rows * n_columns);
        for (row_index, row) in self.deserialize().enumerate() {
            let mut row: Vec<A> = row?;
            if row.len() != n_columns {
                return Err(ReadError::NColumns {
                    at_row_index: row_index,
                    expected: n_columns,
                    actual: row.len(),
                });
            }
            data.append(&mut row);
        }
        let data_len = data.len();
        Array2::from_shape_vec(shape, data).map_err(|err| match err.kind() {
            ndarray::ErrorKind::IncompatibleShape => ReadError::NRows {
                expected: n_rows,
                actual: data_len / n_columns,
            },
            _ => unreachable!(),
        })
    }
}

just because it's easier for me to understand. (It took me a while to figure out what the Either and once were doing and why the row was being mapped with Ok in the original implementation.)

@bluss
Copy link
Member

bluss commented Nov 13, 2018

Maybe someone wants to publish a crate that can wrap the whole read csv file into ndarray? I know I've implemented the same thing, but some of the corner cases are tricky. (And most cases will ask for mixed data types which ndarray doesn't really handle.. let's find a good solution for data frames 😄 )

@jturner314
Copy link
Member

Maybe someone wants to publish a crate that can wrap the whole read csv file into ndarray?

I think @paulkernfeld is working on this. (ndarray-csv crate)

I know I've implemented the same thing, but some of the corner cases are tricky.

Rows having unequal lengths and deserialization errors are the only corner cases that immediately come to mind for me. Are there any other corner cases to worry about?

let's find a good solution for data frames

IIRC, you were working on a data frame project a while ago. How did that go?

@bluss
Copy link
Member

bluss commented Nov 13, 2018

@jturner314 I see, that makes sense. I guess my tricky one was guessing the type of each column. :)

The data frame that was last christmas break and that was the only time I had time for such a project. I shouldn't do it :) Unless I get a very long break soon.

Some troubles in that project: "Wrapping" ndarray arrays and offering the same owned/view interface.

Supporting the NetCDF (HDF5 more or less) type system leads to lots of enums: an enum to wrap each possible scalar type, and a corresponding enum to again wrap an ndarray of each possible scalar type and so on. There is a dataset, and it instead uses type erasure of the underlying data arrays (each array has a uniform element type).

/// The Dataset is like a labelled set of DataArrays
///
/// We could think of each data array as a "column", except we allow multiple
/// axes in each array.
///
/// So for example given dimensions x, y, time  we can have columns:
///
/// ```text
///
/// variable        axes        data type
/// =====================================
/// temperature     x, y, time  f64
/// elevation       x, y        f64
/// solar activity  time        f64
/// cultiv. policy  time        string
///
/// coordinate      axes        data type
/// =====================================
/// x               -           i64
/// y               -           i64
/// time            -           f64
/// ```

The project looked a bit like something that should stay in Python, especially since the user needs to supply a concrete type when they want to read the values of a specific variable in the Dataset. All of that is just even more unpleasant in general code and utility methods on these types IMO.

It's all just a big WIP and doesn't do anything ☹️, it tries to solve the Rust-lang specific problems around types and ownership and doesn't get so far as to implement anything domain specific. It may well be it needs a restart with a new plan.

I can show some debugging output to show a bit about how the data structures are constructed:

#[test]
fn test_simple() {
    let mut a = Array::zeros((3, 4));
    a[[1, 2]] = 1.;
    let da = DataArray::from(a).dim_names(vec!["x", "y"]).with_name("data");

    let mut ds = Dataset::from(da);
    let s = String::from;
    ds.attributes_mut().insert(s("title"), AttributeValue::from("test dataset"));
    println!("{:#?}", ds);
    println!("{:#?}", ds.variable::<f64>("data"));
}

has the following output:

Dataset {
    dims: {"x": 3, "y": 4},
    coordinates: [
        ("x", Range(RangeIndex { from: 0, to: 3 })),
        ("y", Range(RangeIndex { from: 0, to: 4 }))
    ],
    attributes: {
        "title": Text("test dataset")
    },
    variables: {
        "data": Variable {
            type_id: TypeId {
                t: 14742493193654942124
            },
            data_array: DataArray {
                dims: {"x": 3, "y": 4},
                coordinates: [
                    Range(RangeIndex { from: 0, to: 3 }),
                    Range(RangeIndex { from: 0, to: 4 })
                ],
                name: "data",
                attributes: {},
                values: 
                [[TypeErase, TypeErase, TypeErase, TypeErase],
                 [TypeErase, TypeErase, TypeErase, TypeErase],
                 [TypeErase, TypeErase, TypeErase, TypeErase]] shape=[3, 4], strides=[4, 1], layout=C (0x1),
                typecode: None,
                info: Some(
                    0x0000559dc05a5430
                )
            }
        }
    }
}
Ok(
    DataArray {
        dims: {"x": 3, "y": 4},
        coordinates: [
            Range(RangeIndex { from: 0, to: 3 }),
            Range(RangeIndex { from: 0, to: 4 })
        ],
        name: "data",
        attributes: {},
        values: 
        [[   0.0,    0.0,    0.0,    0.0],
         [   0.0,    0.0,    1.0,    0.0],
         [   0.0,    0.0,    0.0,    0.0]] shape=[3, 4], strides=[4, 1], layout=C (0x1),
        typecode: None,
        info: Some(
            0x0000559dc05a5430
        )
    }
)

@jturner314
Copy link
Member

since the user needs to supply a concrete type when they want to read the values of a specific variable in the Dataset

Hmm... That is quite inconvenient. Based on your experience with the challenges of a dynamic approach (enums everywhere and type erasure), I wonder if a much more static approach would make sense (using a proc macro to define specific dataframe types). So the user could define a dataframe and implement the relevant methods/traits like this:

dataframe!{
    MyDataframe {
        attributes: {
            title: String,
        },
        coordinates: {x: i64, y: i64, time: f64},
        variables: {
            temperature (x, y, time): f64,
            elevation (x, y): f64,
            solar_activity (time): f64,
            culiv_policy (time): String,
        },
    }
}

There would be methods on MyDataframe for accessing individual coordinates/variables by their name, but as much functionality as possible would be put into traits common to different dataframe types. It might be necessary to define a type for each coordinate and variable too; I'm not sure.

I shouldn't do it :) Unless I get a very long break soon.

That's understandable. :)

@bluss
Copy link
Member

bluss commented Nov 15, 2018

That's a good idea. This pattern seems familiar in Rust.., maybe it's the way we have to do it.

@LukeMathWalker
Copy link
Member

Give me dataframes in Rust and I can slowly start to use it for work ❤️

@rcarson3
Copy link
Contributor

rcarson3 commented Nov 22, 2018

I've been working on a Rust Data Reader equivalent to Numpy's load_txt. You can find the current framework here: https://github.com/rcarson3/rust_data_reader. It has the various Rust primitive types supported. The data is outputted in a struct that contains the data all in a vector and the number of lines and columns read are also provided. Therefore, it should be pretty easy to write a simple wrapper to have the data imported into a ndarray type using either from_shape_vec or from_vec. I haven't had a lot of time to implement a faster backend yet, but it's not too bad currently. I'm able able to get roughly 50+ MB/s, and I'm hoping once a deadline at work passes I'll have the time to work on it some more. Eventually, I'd like to also support multiple data types being read in from a file into the project as well.

Edit: I should mention that contrary to the Readme it works pretty well currently. I do need to fix a few small edge cases related to how it counts the number of lines and the number of lines that are comments. However, I'd say it's pretty usable for data you trust how it's outputted and as long as the data is one of the primitive types.

@rcarson3
Copy link
Contributor

rcarson3 commented Nov 22, 2018

@bluss @jturner314 @LukeMathWalker I just remembered about the DataFusion project that Andy Grove was working on that had some dataframe stuff in it. It led me to a couple of posts: Rust Dataframes, Andy Grove's Post on Apache Arrow, and a reddit post on dataframes. It seems as of 6 months ago enums and some usage of traits were the various attempts used to create a dataframe library. It also seems that the Apache Arrow project now has a native Rust implementation. It appears that Arrows might be a good library/data format to use as an underlying data structure for a Dataframe based library.

I'm not sure if some of these might be the best approaches for a pure Rust implementation of a dataframe library, but it's always nice to see what's already been done in the field.

@elbaro
Copy link

elbaro commented Aug 29, 2020

How can I do this in ndarray?

fn clean_NA(df: Array2<f64>) {
  let df = df.outer_iter().filter(|row| check_row_has_no_NA()).collect();
  // or
  let df = df.into_outer_iter_par().filter(|row| check_row_has_no_NA()).collect();
  df
}

Using Zip requires to know the filtered number of rows in advance.

@elbaro
Copy link

elbaro commented Aug 29, 2020

stack or select does not work in this case #269

// The input type is
// ndarray::ArrayBase<ndarray::data_repr::OwnedRepr<alloc::string::String>, ndarray::dimension::dim::Dim<[usize; 2]>>

// clean up before parsing items to f64
fn clean(df: &Array2<String>) -> Array2<String> {
    let mut out = Array2::<String>::default((0, df.shape()[1]));
    for row in df.outer_iter() {
        if row[5] != "NA" {
            ndarray::stack![Axis(0), out, row.insert_axis(Axis(0)).clone()];
        }
    }
    out
}


 ndarray::stack![Axis(0), out, row.insert_axis(Axis(0)).clone()];
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `std::marker::Copy` is not implemented for `std::string::String`
the trait bound `std::string::String: std::marker::Copy` is not satisfied
fn clean(df: &Array2<String>) -> Array2<String> {
    let indices:Vec<usize> = df.outer_iter().enumerate().filter(|(idx, row)| row[5] != "NA").map(|(idx, _)| idx).collect();
    let out = df.select(Axis(0), &indices[..]);
    out
}


    let out = df.select(Axis(0), &indices[..]);
                 ^^^^^^ the trait `std::marker::Copy` is not implemented for `std::string::String`
the trait bound `std::string::String: std::marker::Copy` is not satisfied

@nemosupremo
Copy link

I came across a use case where I wanted to build an 2D ndarray where I had filtered some rows. Intuitievly I wanted to do something like:

let filtered: Array<usize, Ix2> = table
  .genrows()
  .into_iter()
  .filter(|r| r.iter().all(|v| v > 0))
  .collect()

@kaimast
Copy link

kaimast commented Feb 18, 2021

I have a similar issue. I want to convert a json file that is a set of nested vectors into an ArrayD, similar how you can pass a json to numpy.array. I tried using the serde_json features built into this crate but I think it expects a different format.

I ended up building a custom function that iterates over the json to detect the shape and the build the array. It's not super efficient thought.

@bluss
Copy link
Member

bluss commented Feb 18, 2021

@nemosupremo That makes sense to me - there's a lot we can do.

@jeremysalwen
Copy link

jeremysalwen commented Dec 25, 2021

I'd actually suggest a method signature like this instead:

// Place in src/impl_2d.rs
impl<A, S> ArrayBase<S, Ix2>
where
    S: Data<Elem = A>,
{
    pub fn from_rows<I>(rows: I) -> Result<Self, ShapeError>
    where
        I: IntoIterator,
        I::Item: IntoIterator<Item = A>,
        S: DataOwned,
    {
        unimplemented!()
    }
}

There has been a lot of topic drift on this thread since this comment, but I want to refocus on the original question:

  1. Can we get a function like the suggested from_rows above into ndarray?
  2. In the meantime, what is the best way to convert a list of rows into an Array2? (i.e. Vec<Vec<>> or Iterator<Iterator<>> type data structures)?

The best way I could think of is to collect into Vec<Vec<>>, measure the dimensions, and then use from_shape_fn to construct it?

@sjackman
Copy link

https://stackoverflow.com/questions/65351813/whats-the-right-way-to-populate-a-rust-ndarray-from-an-iterator-over-structs

The vector that is collected into is directly used as the backing storage for the array

https://docs.rs/ndarray/latest/ndarray/type.Array2.html#method.from

let arr: Array2<f64> = iter
    .map(|row| [row.a, row.b, row.c])
    .collect::<Vec<_>>()
    .into();

Collect into a Vec and then Array2::from(vec) or vec.into().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants