Parquet on HDFS #9

benjdv · 2019-07-10T08:44:09Z

As parquet file are often stored in HDFS, it will be great to have possibility to read foreign table directly from HDFS.

It exists some HDFS wrapper (https://wiki.postgresql.org/wiki/Foreign_data_wrappers) but they used hbase or hive. (Note that parquet_fdw wrapper is note present in this page).

zilder · 2019-07-10T11:57:16Z

Hi @benjdv,

AFAICS, libarrow may be built with hdfs support, which means that it's likely doable. I haven't yet work with hdfs so I'll need some time to figure it out. Right now I'm a bit swamped with current work tasks and will get back to this and the other issue as I get more free time.

(Note that parquet_fdw wrapper is note present in this page)

That's a good idea. Will try to find someone with write permissions to postgresql's wiki.

zilder · 2019-07-10T13:00:09Z

Added parquet_fdw to the postgres wiki.

einhverfr · 2019-12-04T18:48:19Z

There is one major disadvantage to doing this as far as I can see (something worth noting up front) and that is that hdfs does not provide seek support so you can't read the parquet metadata until the entire file is sent. A more typical way to process these on HDFS would be to use something else on that stack to do the map-reduce on the storage nodes so you get the benefits there.

That is not to say that this shouldn't be supported but a lot of the performance optimizations will not work on HDFS and this would need to be documented.

sumerman · 2019-12-05T09:54:53Z

@einhverfr not sure what you are referring to. HDFS most definitely supports this operation on API level https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FSDataInputStream.html#seek(long) and as it does support reading block-wise (it enables data parallelism). Seeking into the middle of a 125M block would be inefficient, and you can't append there. Although, I don't think either would be a concern here.

PepperJo · 2020-07-07T09:32:51Z

Would also love to see S3 support which should also be supported by libarrow.

darkebe mentioned this issue Sep 13, 2022

Add dockerfile #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet on HDFS #9

Parquet on HDFS #9

benjdv commented Jul 10, 2019

zilder commented Jul 10, 2019

zilder commented Jul 10, 2019

einhverfr commented Dec 4, 2019

sumerman commented Dec 5, 2019

PepperJo commented Jul 7, 2020

Parquet on HDFS #9

Parquet on HDFS #9

Comments

benjdv commented Jul 10, 2019

zilder commented Jul 10, 2019

zilder commented Jul 10, 2019

einhverfr commented Dec 4, 2019

sumerman commented Dec 5, 2019

PepperJo commented Jul 7, 2020