Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples under amadeus-parquet have some problems #22

Open
vertexclique opened this issue Dec 13, 2019 · 4 comments
Open

Examples under amadeus-parquet have some problems #22

vertexclique opened this issue Dec 13, 2019 · 4 comments

Comments

@vertexclique
Copy link

Hi!

I was gazing to the parquet parser example. Examples under README.md https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/README.md has some problems. I couldn't make them run.

I have a couple of suggestions for the parquet crate:

  • There is no standalone parquet reader at the outside, would be really nice to have this parquet reader as a separate non-dependent package.
  • some serdes are creating different names for List. For example with my structures Data derivation proc-macro does this:
    I got this:
 REQUIRED group events (LIST) {
 REPEATED group list {
 REQUIRED group element {

where I expect this:

 REQUIRED group events (LIST) {
 REPEATED group array {

Where I simply want a structure in an array and java's arrow serializer serialized the structs with the name of array and didn't add yet another nested level. If you add a guide for this nested structure that would be nice.

@alecmocatta
Copy link
Member

alecmocatta commented Dec 17, 2019 via email

@vertexclique
Copy link
Author

vertexclique commented Dec 17, 2019

Hi Alec;

Thanks for your response. That is very helpful. Let me add my observations here:

Printing the expected schema (which I guess is happening for you as part of
an error message?) will, IIRC, print the “normal” parquet list schema
rather than the also valid potential “compat” schemas - which is
potentially confusing matters here?

That is solved. Even the field names are different at the nested level like I've showed in my first comment. If a person comes with having this nested issue. Here is the solution because the current macro is generating flat levels(which is good don't worry):

#[derive(Data, Debug, Clone)]
pub struct Event {
    pub event_id: Option<String>
}

#[derive(Data, Debug, Clone)]
pub struct Events {
    pub array: List<Event>
}

#[derive(Data, Debug, Clone)]
pub struct EventStream {
    pub events: Events,
}

This solved my problems and you can discard my comment over it.

BUT I have one more important thing to comment on because that is very important in some analytical workloads. The columnar format readers are sometimes meant to not read the whole schema. I saw that you have implemented the argument taking for the subset (it is called projection in code):

	/// Creates row iterator for all row groups in a file.
	pub fn from_file(_proj: Option<Predicate>, reader: R) -> Result<Self> {
		let file_schema = reader.metadata().file_metadata().schema_descr_ptr();
		let file_schema = file_schema.root_schema();
		let schema = <Root<T> as ParquetData>::parse(file_schema, None)?.1;

		Ok(Self::new(Some(reader), None, schema))
	}

Is it possible to have this predicate pushdown? If this is implemented (or I can help in implementation) I will abandon the original implementation of parquet.

@alecmocatta
Copy link
Member

alecmocatta commented Dec 18, 2019 via email

@vertexclique
Copy link
Author

Would be cool, I will jump to chat soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants