-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Examples under amadeus-parquet have some problems #22
Comments
Hi Mahmut and thanks for the issue!
Apologies I’m on my phone on holiday this week but I’ll try my best to respond.
* There is no standalone parquet reader at the outside, would be really nice to have this parquet reader as a separate non-dependent package.
Agreed; I think standalone parquet reader interfaces might just about exist in the amadeus-parquet crate - I’ll fix its docs.rs build this weekend, and adding examples has been on my TODO list for a while! https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/README.md is in dire need of updating, I’ll get to that when I can.
The [`List`](https://docs.rs/amadeus/0.1.5/amadeus/struct.List.html) data type *should* be usable to read any of the various [back-compat variations of lists](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules), which I believe includes what you’re describing. It should work for `List<MyStruct>` where `MyStruct` is a struct that derives `Data`. If that doesn’t work then it’s most likely a bug in the `parse_list` method [here](https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/internal/record/impls.rs#L424). Printing the expected schema (which I guess is happening for you as part of an error message?) will, IIRC, print the “normal” parquet list schema rather than the also valid potential “compat” schemas - which is potentially confusing matters here?
Are you able to include the full error you’re seeing, and if possible a code sample, and I’ll do my best to resolve!
…On Fri, 13 Dec 2019 at 17:53, Mahmut Bulut ***@***.***> wrote:
Hi!
I was gazing to the parquet parser example. Examples under README.md
https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/README.md
has some problems. I couldn't make them run.
I have a couple of suggestions for the parquet crate:
- There is no standalone parquet reader at the outside, would be
really nice to have this parquet reader as a separate non-dependent package.
- some serdes are creating different names for List. For example with
my structures Data derivation proc-macro does this:
I got this:
REQUIRED group events (LIST) {
REPEATED group list {
REQUIRED group element {
where I expect this:
REQUIRED group events (LIST) {
REPEATED group array {
Where I simply want a structure in an array and java's arrow serializer
serialized the structs with the name of array and didn't add yet another
nested level. If you add a guide for this nested structure that would be
nice.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#22?email_source=notifications&email_token=AAIVM5XJFQTMWEIQWKETZX3QYO4YDA5CNFSM4J2QIJX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IAL4BHQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIVM5W2ALSF7OXF6AT3J6DQYO4YDANCNFSM4J2QIJXQ>
.
|
Hi Alec; Thanks for your response. That is very helpful. Let me add my observations here:
That is solved. Even the field names are different at the nested level like I've showed in my first comment. If a person comes with having this nested issue. Here is the solution because the current macro is generating flat levels(which is good don't worry):
This solved my problems and you can discard my comment over it. BUT I have one more important thing to comment on because that is very important in some analytical workloads. The columnar format readers are sometimes meant to not read the whole schema. I saw that you have implemented the argument taking for the subset (it is called projection in code): /// Creates row iterator for all row groups in a file.
pub fn from_file(_proj: Option<Predicate>, reader: R) -> Result<Self> {
let file_schema = reader.metadata().file_metadata().schema_descr_ptr();
let file_schema = file_schema.root_schema();
let schema = <Root<T> as ParquetData>::parse(file_schema, None)?.1;
Ok(Self::new(Some(reader), None, schema))
} Is it possible to have this predicate pushdown? If this is implemented (or I can help in implementation) I will abandon the original implementation of parquet. |
I’m glad you’ve got it working! FWIW for the schema you wrote before I
would also expect the following to work:
#[derive(Data, Debug, Clone)]
pub struct Event {
pub event_id: Option<String>
}
#[derive(Data, Debug, Clone)]
pub struct EventStream {
pub events: List<Event>,
}
I’ll have a look into it on my return.
Predicate pushdown would be really cool. IIRC the statistics used for it
are being read, though not used. I think my intention was possibly adding
an associated type to the ParquetData trait that can store predicates. This
would be replace Predicate in the API, and be tested against the statistics
before reading each column chunk and page.
I’ll have a look into it this weekend and perhaps we can hash out the
design on chat.
…On Tue, 17 Dec 2019 at 12:01, Mahmut Bulut ***@***.***> wrote:
Hi Alec;
Thanks for your response. That is very helpful. Let me add my observations
here:
Printing the expected schema (which I guess is happening for you as part of
an error message?) will, IIRC, print the “normal” parquet list schema
rather than the also valid potential “compat” schemas - which is
potentially confusing matters here?
That is solved. Even the field names are different at the nested level
like I've showed in my first comment. If a person comes with this nested
issue here is the solution because current macro is generating flat levels:
#[derive(Data, Debug, Clone)]
pub struct Event {
pub event_id: Option<String>
}
#[derive(Data, Debug, Clone)]
pub struct Events {
pub array: List<Event>
}
#[derive(Data, Debug, Clone)]
pub struct EventStream {
pub events: Events,
}
This solved my problems and you can discard my comment over it.
BUT I have one more important thing to comment on because that is very
important in some analytical workloads. The columnar format readers are
sometimes meant to not read the whole schema. I saw that you have
implemented the argument taking for the subset (it is called projection in
code):
/// Creates row iterator for all row groups in a file.
pub fn from_file(_proj: Option<Predicate>, reader: R) -> Result<Self> {
let file_schema = reader.metadata().file_metadata().schema_descr_ptr();
let file_schema = file_schema.root_schema();
let schema = <Root<T> as ParquetData>::parse(file_schema, None)?.1;
Ok(Self::new(Some(reader), None, schema))
}
Is it possible to have this predicate pushdown? If this is implemented (or
I can help in implementation) I will abandon the original implementation.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#22?email_source=notifications&email_token=AAIVM5VFUMU3UYLHKLGEGQTQZCWSBA5CNFSM4J2QIJX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHCAA7A#issuecomment-566493308>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIVM5VR6UPSTYXNB5CUD6LQZCWSBANCNFSM4J2QIJXQ>
.
|
Would be cool, I will jump to chat soon. |
Hi!
I was gazing to the parquet parser example. Examples under README.md https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/README.md has some problems. I couldn't make them run.
I have a couple of suggestions for the parquet crate:
List
. For example with my structuresData
derivation proc-macro does this:I got this:
where I expect this:
Where I simply want a structure in an array and java's arrow serializer serialized the structs with the name of
array
and didn't add yet another nested level. If you add a guide for this nested structure that would be nice.The text was updated successfully, but these errors were encountered: