Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet on HDFS #9

Open
benjdv opened this issue Jul 10, 2019 · 5 comments
Open

Parquet on HDFS #9

benjdv opened this issue Jul 10, 2019 · 5 comments

Comments

@benjdv
Copy link

benjdv commented Jul 10, 2019

As parquet file are often stored in HDFS, it will be great to have possibility to read foreign table directly from HDFS.

It exists some HDFS wrapper (https://wiki.postgresql.org/wiki/Foreign_data_wrappers) but they used hbase or hive. (Note that parquet_fdw wrapper is note present in this page).

@zilder
Copy link
Contributor

zilder commented Jul 10, 2019

Hi @benjdv,

AFAICS, libarrow may be built with hdfs support, which means that it's likely doable. I haven't yet work with hdfs so I'll need some time to figure it out. Right now I'm a bit swamped with current work tasks and will get back to this and the other issue as I get more free time.

(Note that parquet_fdw wrapper is note present in this page)

That's a good idea. Will try to find someone with write permissions to postgresql's wiki.

@zilder
Copy link
Contributor

zilder commented Jul 10, 2019

Added parquet_fdw to the postgres wiki.

@einhverfr
Copy link

There is one major disadvantage to doing this as far as I can see (something worth noting up front) and that is that hdfs does not provide seek support so you can't read the parquet metadata until the entire file is sent. A more typical way to process these on HDFS would be to use something else on that stack to do the map-reduce on the storage nodes so you get the benefits there.

That is not to say that this shouldn't be supported but a lot of the performance optimizations will not work on HDFS and this would need to be documented.

@sumerman
Copy link
Contributor

sumerman commented Dec 5, 2019

@einhverfr not sure what you are referring to. HDFS most definitely supports this operation on API level https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FSDataInputStream.html#seek(long) and as it does support reading block-wise (it enables data parallelism). Seeking into the middle of a 125M block would be inefficient, and you can't append there. Although, I don't think either would be a concern here.

@PepperJo
Copy link

PepperJo commented Jul 7, 2020

Would also love to see S3 support which should also be supported by libarrow.

@darkebe darkebe mentioned this issue Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants