Does Carpet buffer the whole parquet data or only a row group? #36

javafanboy · 2024-08-15T09:06:58Z

javafanboy
Aug 15, 2024

I am looking for a way to using a normal Java program, with a minimum of libraries etc (not using Spark or Arrow for instance) , create large Parquet objects in S3 without keeping all the data in memory or store it to file first i.e. I would like to write the data to an output stream that uploads to S3. I have a stable implementation of such a stream and before starting to play around with Carpet I would like to know if it supports the use of "any output stream" and if so if it flushes data after each row group or only after writing the whole dataset including the header?

I started looking at the standard Apache Parquet library but it only seem to work with "files" and as known also have the problem with all the dependencies (I have not found any info on exactly what can be excluded in my macev file, trial and error seem time consuming).... An option to get around the file limitation is to use memory mapped files but this also seem quite time consuming given the quite sparse documentation of Hadoop files etc...

Any thoughts or advice is warmly appreciated!

jerolba · 2024-08-15T09:30:19Z

jerolba
Aug 15, 2024
Maintainer

Hi @javafanboy

Yes, Carpet supports to write to any OutputStream. I implemented explicitly this constructor to support that use case. Base Parquet library doesn't support it and I also implemented a custom OuputFile, OutputStreamOutputFile.

I've not tried, but I think that there is no issue on writing to AWS SDK OutputStream.

I'm not an expert of internal implementation, but because you write records one by one, it can not flush information by column, and must wait to the latest column of the latest record of a row group to flush the in-memory buffer. I think that this applies to any case, wherever you write, it buffers a complete row group before flushing to disk/stream. You can tune the size of row group in parquet configuration (exposed by Carpet Writer builder).

I did the work of removing dependencies by trial and error. You can see it in the gradle file: https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/build.gradle#L20

0 replies

javafanboy · 2024-08-15T10:07:33Z

javafanboy
Aug 15, 2024
Author

I anyhow want a bunch of records to be written before flushing to keep performance as good as possible so row group level (that I also can control the size of) seems quite optimal. Will give Carpet a try. Another small thing, do carpet "hold on" to the object I pass in or consume it directly so I can reuse it if I want to (I am trying to do my code as "allocation free" as I can and I will process MANY small records), i.e. is the object assumed to be "immutable"...

…

On Thu, Aug 15, 2024 at 11:30 AM Jeronimo López ***@***.***> wrote: Hi @javafanboy <https://github.com/javafanboy> Yes, Carpet supports to write to any OutputStream. I implemented explicitly this constructor <https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/src/main/java/com/jerolba/carpet/CarpetWriter.java#L81> to support that use case. Base Parquet library doesn't support it and I also implemented a custom OuputFile, OutputStreamOutputFile <https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/src/main/java/com/jerolba/carpet/io/OutputStreamOutputFile.java> . I've not tried, but I think that there is no issue on writing to AWS SDK OutputStream. I'm not an expert of internal implementation, but because you write records one by one, it can not flush information by column, and must wait to the latest column of the latest record of a row group to flush the in-memory buffer. I think that this applies to any case, wherever you write, it buffers a complete row group before flushing to disk/stream. You can tune the size of row group in parquet configuration (exposed by Carpet Writer builder). I did the work of removing dependencies by trial and error. You can see it in the gradle file: https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/build.gradle#L20 — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF467MELO4KY65VTWSDZRRYMBAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZUGU3DIOA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

jerolba · 2024-08-15T14:08:55Z

jerolba
Aug 15, 2024
Maintainer

Yes, Carpet doesn't hold the instance and it does nothing with the object after writing it. You can reuse the instance in consecutive calls to write method.

0 replies

javafanboy · 2024-08-18T05:40:41Z

javafanboy
Aug 18, 2024
Author

Thanks for the info Jeronimo! Really nice and super easy to use library! Two small things, I run on Mac OS M3: - Do you know what I need to include to use Brotli compression (when I tried I got a "missing codec" message)? - Any idea how to get rid of the warning of missing Hadoop native libraries? I know this warning is harmless as the code still runs but would like to run without warnings and also get every ounce of performance I can get so if there are any native libraries that can be leveraged I would like to try it...

…

On Thu, Aug 15, 2024 at 4:09 PM Jeronimo López ***@***.***> wrote: Yes, Carpet doesn't hold the instance and it does nothing with the object after writing it. You can reuse the instance in consecutive calls to write method. — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF2EBVTZ4ODVBJF3FGDZRSZA3AVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZUHAZTGMY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

jerolba Aug 18, 2024
Maintainer

I've never tried to use Brotli. I see that the implementation codec is not included as dependency. I found this implementation, but it's 7 years old... In theory you just need to add the dependency to the classpath. If you try it, please, share your experience :)
No, I'm not getting that type of warnings (I don't have a Mac). Are they printed using by default configuration or with a concrete compression codec?

javafanboy · 2024-08-18T16:12:31Z

javafanboy
Aug 18, 2024
Author

Thanks for the suggestions - I played around a bit more with the codecs but I did not have any luck with Brotli. After some research I did however manage to get Zstd to work (at first that gave a similar error and that was really why I looked at Brotli that may be the second best thing) so it does not really matter for me any longer if Brotli work or not! For my data Zstd seem to give as short compression time as LZ4 (i.e. slightly longer than Snappy) but with similar compression ratio as GZIP (that takes about 3x longer to compress) so this is what I will try. I query my data mostly with AWS Athena that at least is SUPOSED to work with Zstd nowadays. I tried the Parquet files created with Carpet (with SNAPPY compression) and they worked right away with Athena even for searching with SQL queries also for data in "structured types" so compatibility does not seem to be a problem! Must say I really like Carpet so far - creating Parquet from Java has never been this easy before - I have for instance used SPARK that is nice for doing large scale distributed analytics etc. but for just producing some Parquet files it is really massive overkill and takes a long time to get working. Same goes for Arrow that I also tried a bit...

…

On Sun, Aug 18, 2024 at 12:35 PM Jeronimo López ***@***.***> wrote: - I've never tried to use Brotli. I see that the implementation codec is not included as dependency. I found this implementation <https://github.com/rdblue/brotli-codec/tree/master>, but it's 7 years old... In theory you just need to add the dependency to the classpath. If you try it, please, share your experience :) - No, I'm not getting that type of warnings (I don't have a Mac). Are they printed using by default configuration or with a concrete compression codec? — Reply to this email directly, view it on GitHub <#36 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF4KNND3C6JPF7O4LDLZSB2GTAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGE3DANY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

jerolba Aug 18, 2024
Maintainer

I had the same problem creating and reading multiple small parquet files (<1GB). After investigating different implementations, I decided to create Carpet.

Now, in my work it's the default format, over CSV, even for small files (few kb)

javafanboy · 2024-08-18T16:15:27Z

javafanboy
Aug 18, 2024
Author

The warning I mentioned seem to come when I use GZIP codec - I have googled it extensively and followed several suggestions but nothing seem to help. Hopefully I can use Zstd instead of GZIP and then that warning is not an issue either :-)

…

On Sun, Aug 18, 2024 at 6:12 PM Javafanboy ***@***.***> wrote: Thanks for the suggestions - I played around a bit more with the codecs but I did not have any luck with Brotli. After some research I did however manage to get Zstd to work (at first that gave a similar error and that was really why I looked at Brotli that may be the second best thing) so it does not really matter for me any longer if Brotli work or not! For my data Zstd seem to give as short compression time as LZ4 (i.e. slightly longer than Snappy) but with similar compression ratio as GZIP (that takes about 3x longer to compress) so this is what I will try. I query my data mostly with AWS Athena that at least is SUPOSED to work with Zstd nowadays. I tried the Parquet files created with Carpet (with SNAPPY compression) and they worked right away with Athena even for searching with SQL queries also for data in "structured types" so compatibility does not seem to be a problem! Must say I really like Carpet so far - creating Parquet from Java has never been this easy before - I have for instance used SPARK that is nice for doing large scale distributed analytics etc. but for just producing some Parquet files it is really massive overkill and takes a long time to get working. Same goes for Arrow that I also tried a bit... On Sun, Aug 18, 2024 at 12:35 PM Jeronimo López ***@***.***> wrote: > > - I've never tried to use Brotli. I see that the implementation codec > is not included as dependency. I found this implementation > <https://github.com/rdblue/brotli-codec/tree/master>, but it's 7 > years old... In theory you just need to add the dependency to the > classpath. If you try it, please, share your experience :) > - No, I'm not getting that type of warnings (I don't have a Mac). Are > they printed using by default configuration or with a concrete compression > codec? > > — > Reply to this email directly, view it on GitHub > <#36 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AADXQF4KNND3C6JPF7O4LDLZSB2GTAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGE3DANY> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

0 replies

javafanboy · 2024-08-18T16:48:00Z

javafanboy
Aug 18, 2024
Author

Is it by the way possible to set the "compression level" of the codecs using Carpet API? I noticed when examining the schema that for Zstd level 1 is used (which may be fine for fast compression) but I would have liked to try out some higher levels as well... I looked at the options for the "builder" and did not immediately see anything that "jumped out" at me as related to this....

…

On Sun, Aug 18, 2024 at 6:15 PM Javafanboy ***@***.***> wrote: The warning I mentioned seem to come when I use GZIP codec - I have googled it extensively and followed several suggestions but nothing seem to help. Hopefully I can use Zstd instead of GZIP and then that warning is not an issue either :-) On Sun, Aug 18, 2024 at 6:12 PM Javafanboy ***@***.***> wrote: > Thanks for the suggestions - I played around a bit more with the codecs > but I did not have any luck with Brotli. > > After some research I did however manage to get Zstd to work (at first > that gave a similar error and that was really why I looked at Brotli that > may be the second best thing) so it does not really matter for me any > longer if Brotli work or not! > > For my data Zstd seem to give as short compression time as LZ4 (i.e. > slightly longer than Snappy) but with similar compression ratio as GZIP > (that takes about 3x longer to compress) so this is what I will try. > > I query my data mostly with AWS Athena that at least is SUPOSED to work > with Zstd nowadays. I tried the Parquet files created with Carpet (with > SNAPPY compression) and they worked right away with Athena even for > searching with SQL queries also for data in "structured types" so > compatibility does not seem to be a problem! > > Must say I really like Carpet so far - creating Parquet from Java has > never been this easy before - I have for instance used SPARK that is nice > for doing large scale distributed analytics etc. but for just producing > some Parquet files it is really massive overkill and takes a long time to > get working. Same goes for Arrow that I also tried a bit... > > > > On Sun, Aug 18, 2024 at 12:35 PM Jeronimo López ***@***.***> > wrote: > >> >> - I've never tried to use Brotli. I see that the implementation >> codec is not included as dependency. I found this implementation >> <https://github.com/rdblue/brotli-codec/tree/master>, but it's 7 >> years old... In theory you just need to add the dependency to the >> classpath. If you try it, please, share your experience :) >> - No, I'm not getting that type of warnings (I don't have a Mac). >> Are they printed using by default configuration or with a concrete >> compression codec? >> >> — >> Reply to this email directly, view it on GitHub >> <#36 (reply in thread)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AADXQF4KNND3C6JPF7O4LDLZSB2GTAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGE3DANY> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.*** >> com> >> >

1 reply

jerolba Aug 18, 2024
Maintainer

Reading the documentation and source code, I see that you need to configure a parameter (parquet.compression.codec.zstd.level) in Parquet Configuration.

For customizations of codecs and utilities, Parquet uses a Configuration class (now migrating to ParquetConfiguration). In Carpet to setup the codec and Parquet configuration you need to use CarpetWriter builder:

var conf = new PlainParquetConfiguration(Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, "10"));
try (var writer = new CarpetWriter.Builder<>(output, YourClass.class)
        .withCompressionCodec(CompressionCodecName.ZSTD)
        .withConf(conf)
        .build()) {
    writer.write(yourData);
}

javafanboy · 2024-08-18T19:05:56Z

javafanboy
Aug 18, 2024
Author

I just tried this but still get Zstd level 1 (at least according to "pqrs" tool that I use to look at the generated Parquet file). The relevant part of my test program now looks like this, you see anything that is off? Perhaps this is also some "MacOS" problem... void testEncoding() throws IOException { System.out.println("Generating test data..."); for(int i = 0; i < 1_000; i++) { data.add(generateRandomLogEventRecord()); } System.out.println("Sorting test data..."); data.sort(Comparator.comparing(LogEventRecord::node).reversed() .thenComparing(LogEventRecord::threadId) .thenComparing(LogEventRecord::timeMillis) .thenComparing(LogEventRecord::timeNanos)); System.out.println("Generating Parquet..."); long start = System.currentTimeMillis(); try (OutputStream outputStream = new FileOutputStream("logevents.parquet")) { final PlainParquetConfiguration conf = new PlainParquetConfiguration( Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, "6")); try (CarpetWriter<LogEventRecord> writer = new CarpetWriter.Builder<>(outputStream, LogEventRecord.class) .withWriteMode(ParquetFileWriter.Mode.OVERWRITE) .withCompressionCodec(CompressionCodecName.ZSTD) .withConf(conf) .build()) { writer.write(data); } } long stop = System.currentTimeMillis(); System.out.println("ZSTD Time taken : " + (stop - start)); } and a subset of the output from *pqrs* looks like this: column 1: -------------------------------------------------------------------------------- column type: BYTE_ARRAY column path: "node" encodings: BIT_PACKED PLAIN_DICTIONARY file path: N/A file offset: 430 num of values: 1000 compression: ZSTD(ZstdLevel(1))

…

Message ID: <jerolba/parquet-carpet/repo-discussions/36/comments/10375029@ github.com>

1 reply

jerolba Aug 18, 2024
Maintainer

For me it's all ok... I don't see nothing wrong...

Try to write N versions of the file with different level configured and see if the result is the same or if each file is different.

What does pqrs say?

for (int i=1;i<22;i++) {
      System.out.println("Generating Parquet...");
      long start = System.currentTimeMillis();
      try (OutputStream outputStream = new FileOutputStream("logevents"+i+".parquet")) {
          final PlainParquetConfiguration conf = new PlainParquetConfiguration(
                  Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, ""+i));
          try (CarpetWriter<LogEventRecord> writer = new
  CarpetWriter.Builder<>(outputStream, LogEventRecord.class)
                  .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
                  .withCompressionCodec(CompressionCodecName.ZSTD)
                  .withConf(conf)
                  .build()) {
              writer.write(data);
          }
      }
      long stop = System.currentTimeMillis();
      System.out.println("ZSTD Time taken : " + (stop - start));
}

javafanboy · 2024-08-19T05:04:09Z

javafanboy
Aug 19, 2024
Author

Good idea - I tried it and the test showed that setting level actually DO WORK (time increased with level and file size decreased, in this case very marginally so I did ot notice it at first but still) so pqrs obviously have a bug - will report it to them! Thanks for the help!

…

On Sun, Aug 18, 2024 at 10:18 PM Jeronimo López ***@***.***> wrote: For me it's all ok... I don't see nothing wrong... Try to write N versions of the file with different level configured and see if the result is the same or if each file is different. What does pqrs say? for (int i=1;i<22;i++) { System.out.println("Generating Parquet..."); long start = System.currentTimeMillis(); try (OutputStream outputStream = new FileOutputStream("logevents"+i+".parquet")) { final PlainParquetConfiguration conf = new PlainParquetConfiguration( Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, ""+i)); try (CarpetWriter<LogEventRecord> writer = new CarpetWriter.Builder<>(outputStream, LogEventRecord.class) .withWriteMode(ParquetFileWriter.Mode.OVERWRITE) .withCompressionCodec(CompressionCodecName.ZSTD) .withConf(conf) .build()) { writer.write(data); } } long stop = System.currentTimeMillis(); System.out.println("ZSTD Time taken : " + (stop - start)); } — Reply to this email directly, view it on GitHub <#36 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXQF55BVQJ4UJ4X7ZYTFDZSD6RVAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGU3TQOA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

javafanboy · 2024-08-20T16:41:18Z

javafanboy
Aug 20, 2024
Author

Hi! I filed an error report with pqrs and they replied that Zstd does NOT store the compression level in Parquet files and in this situation they "just print 1" instead of not printing anything. It will be fixed in an upcoming release...

…

On Mon, Aug 19, 2024 at 7:03 AM Javafanboy ***@***.***> wrote: Good idea - I tried it and the test showed that setting level actually DO WORK (time increased with level and file size decreased, in this case very marginally so I did ot notice it at first but still) so pqrs obviously have a bug - will report it to them! Thanks for the help! On Sun, Aug 18, 2024 at 10:18 PM Jeronimo López ***@***.***> wrote: > For me it's all ok... I don't see nothing wrong... > > Try to write N versions of the file with different level configured and > see if the result is the same or if each file is different. > > What does pqrs say? > > for (int i=1;i<22;i++) { > System.out.println("Generating Parquet..."); > long start = System.currentTimeMillis(); > try (OutputStream outputStream = new FileOutputStream("logevents"+i+".parquet")) { > final PlainParquetConfiguration conf = new PlainParquetConfiguration( > Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, ""+i)); > try (CarpetWriter<LogEventRecord> writer = new > CarpetWriter.Builder<>(outputStream, LogEventRecord.class) > .withWriteMode(ParquetFileWriter.Mode.OVERWRITE) > .withCompressionCodec(CompressionCodecName.ZSTD) > .withConf(conf) > .build()) { > writer.write(data); > } > } > long stop = System.currentTimeMillis(); > System.out.println("ZSTD Time taken : " + (stop - start)); > } > > — > Reply to this email directly, view it on GitHub > <#36 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AADXQF55BVQJ4UJ4X7ZYTFDZSD6RVAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGU3TQOA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Carpet buffer the whole parquet data or only a row group? #36

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Does Carpet buffer the whole parquet data or only a row group? #36

javafanboy Aug 15, 2024

Replies: 10 comments · 4 replies

jerolba Aug 15, 2024 Maintainer

javafanboy Aug 15, 2024 Author

jerolba Aug 15, 2024 Maintainer

javafanboy Aug 18, 2024 Author

jerolba Aug 18, 2024 Maintainer

javafanboy Aug 18, 2024 Author

jerolba Aug 18, 2024 Maintainer

javafanboy Aug 18, 2024 Author

javafanboy Aug 18, 2024 Author

jerolba Aug 18, 2024 Maintainer

javafanboy Aug 18, 2024 Author

jerolba Aug 18, 2024 Maintainer

javafanboy Aug 19, 2024 Author

javafanboy Aug 20, 2024 Author

javafanboy
Aug 15, 2024

Replies: 10 comments 4 replies

jerolba
Aug 15, 2024
Maintainer

javafanboy
Aug 15, 2024
Author

jerolba
Aug 15, 2024
Maintainer

javafanboy
Aug 18, 2024
Author

jerolba Aug 18, 2024
Maintainer

javafanboy
Aug 18, 2024
Author

jerolba Aug 18, 2024
Maintainer

javafanboy
Aug 18, 2024
Author

javafanboy
Aug 18, 2024
Author

jerolba Aug 18, 2024
Maintainer

javafanboy
Aug 18, 2024
Author

jerolba Aug 18, 2024
Maintainer

javafanboy
Aug 19, 2024
Author

javafanboy
Aug 20, 2024
Author