Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure Compression Settings #18

Open
robertoarnetoli opened this issue Jun 11, 2023 · 6 comments
Open

Configure Compression Settings #18

robertoarnetoli opened this issue Jun 11, 2023 · 6 comments

Comments

@robertoarnetoli
Copy link

I have tried target-s3 and I was able to change prefix and stream_name_path_override but inside the S3 folder I get a compressed ".json.gz". Instead, I need a file called "data.json" without compression.

@crowemi
Copy link
Owner

crowemi commented Jun 12, 2023

Unfortunately this is hardcoded ATM - we need to modify this section to be dynamic. We should do this by adding a compression configuration node (maybe under the format node, or perhaps its own node 🤔)

Thinking big picture, we want to satisfy the following requirements:

  1. Enable/disable compression
  2. Support multiple compressions types (e.g. {‘NONE’, ‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’})

@crowemi crowemi changed the title Ability to name the actual file (not just the stream) and to turn off .gz compression Configure Compression Settings Jun 12, 2023
@robertoarnetoli
Copy link
Author

Thank you @crowemi for the fast response.
As for the ".json.gz" file is there a way to add a name to the file like "data.json.gz". At the moment is literally ".json.gz" without name

@crowemi
Copy link
Owner

crowemi commented Jun 12, 2023

The problem is here, doesn't look like we're handling any cases where those two config elements aren't set -- if it meets your reqs, you should be able to add a file name by setting the append_date_to_filename in your config here.

@robertoarnetoli
Copy link
Author

ok. I was looking for a static filename rather than a timestamp, but thank you anyway.

@rstml
Copy link
Contributor

rstml commented Nov 23, 2023

+1 for this.

Current default option (gzip) isn't he best option for Parquet files:

Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.

https://issues.apache.org/jira/browse/SPARK-14482

@kronnk
Copy link

kronnk commented Jul 17, 2024

Hi, i would like to work on this issue. Since there is no contribution guidelines, is there anything i should pay attention to ?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants