-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet on HDFS #9
Comments
Hi @benjdv, AFAICS,
That's a good idea. Will try to find someone with write permissions to postgresql's wiki. |
Added |
There is one major disadvantage to doing this as far as I can see (something worth noting up front) and that is that hdfs does not provide seek support so you can't read the parquet metadata until the entire file is sent. A more typical way to process these on HDFS would be to use something else on that stack to do the map-reduce on the storage nodes so you get the benefits there. That is not to say that this shouldn't be supported but a lot of the performance optimizations will not work on HDFS and this would need to be documented. |
@einhverfr not sure what you are referring to. HDFS most definitely supports this operation on API level https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FSDataInputStream.html#seek(long) and as it does support reading block-wise (it enables data parallelism). Seeking into the middle of a 125M block would be inefficient, and you can't append there. Although, I don't think either would be a concern here. |
Would also love to see S3 support which should also be supported by libarrow. |
As parquet file are often stored in HDFS, it will be great to have possibility to read foreign table directly from HDFS.
It exists some HDFS wrapper (https://wiki.postgresql.org/wiki/Foreign_data_wrappers) but they used hbase or hive. (Note that parquet_fdw wrapper is note present in this page).
The text was updated successfully, but these errors were encountered: