Does Carpet buffer the whole parquet data or only a row group? #36
Replies: 10 comments 4 replies
-
Hi @javafanboy Yes, Carpet supports to write to any OutputStream. I implemented explicitly this constructor to support that use case. Base Parquet library doesn't support it and I also implemented a custom OuputFile, OutputStreamOutputFile. I've not tried, but I think that there is no issue on writing to AWS SDK OutputStream. I'm not an expert of internal implementation, but because you write records one by one, it can not flush information by column, and must wait to the latest column of the latest record of a row group to flush the in-memory buffer. I think that this applies to any case, wherever you write, it buffers a complete row group before flushing to disk/stream. You can tune the size of row group in parquet configuration (exposed by Carpet Writer builder). I did the work of removing dependencies by trial and error. You can see it in the gradle file: https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/build.gradle#L20 |
Beta Was this translation helpful? Give feedback.
-
I anyhow want a bunch of records to be written before flushing to keep
performance as good as possible so row group level (that I also can control
the size of) seems quite optimal.
Will give Carpet a try.
Another small thing, do carpet "hold on" to the object I pass in or consume
it directly so I can reuse it if I want to (I am trying to do my code as
"allocation free" as I can and I will process MANY small records), i.e. is
the object assumed to be "immutable"...
…On Thu, Aug 15, 2024 at 11:30 AM Jeronimo López ***@***.***> wrote:
Hi @javafanboy <https://github.com/javafanboy>
Yes, Carpet supports to write to any OutputStream. I implemented
explicitly this constructor
<https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/src/main/java/com/jerolba/carpet/CarpetWriter.java#L81>
to support that use case. Base Parquet library doesn't support it and I
also implemented a custom OuputFile, OutputStreamOutputFile
<https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/src/main/java/com/jerolba/carpet/io/OutputStreamOutputFile.java>
.
I've not tried, but I think that there is no issue on writing to AWS SDK
OutputStream.
I'm not an expert of internal implementation, but because you write
records one by one, it can not flush information by column, and must wait
to the latest column of the latest record of a row group to flush the
in-memory buffer. I think that this applies to any case, wherever you
write, it buffers a complete row group before flushing to disk/stream. You
can tune the size of row group in parquet configuration (exposed by Carpet
Writer builder).
I did the work of removing dependencies by trial and error. You can see it
in the gradle file:
https://github.com/jerolba/parquet-carpet/blob/master/carpet-record/build.gradle#L20
—
Reply to this email directly, view it on GitHub
<#36 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF467MELO4KY65VTWSDZRRYMBAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZUGU3DIOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Yes, Carpet doesn't hold the instance and it does nothing with the object after writing it. You can reuse the instance in consecutive calls to write method. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the info Jeronimo!
Really nice and super easy to use library!
Two small things, I run on Mac OS M3:
- Do you know what I need to include to use Brotli compression (when I
tried I got a "missing codec" message)?
- Any idea how to get rid of the warning of missing Hadoop native
libraries? I know this warning is harmless as the code still runs but would
like to run without warnings and also get every ounce of performance I can
get so if there are any native libraries that can be leveraged I would like
to try it...
…On Thu, Aug 15, 2024 at 4:09 PM Jeronimo López ***@***.***> wrote:
Yes, Carpet doesn't hold the instance and it does nothing with the object
after writing it. You can reuse the instance in consecutive calls to write
method.
—
Reply to this email directly, view it on GitHub
<#36 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF2EBVTZ4ODVBJF3FGDZRSZA3AVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZUHAZTGMY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestions - I played around a bit more with the codecs but
I did not have any luck with Brotli.
After some research I did however manage to get Zstd to work (at first that
gave a similar error and that was really why I looked at Brotli that may be
the second best thing) so it does not really matter for me any longer if
Brotli work or not!
For my data Zstd seem to give as short compression time as LZ4 (i.e.
slightly longer than Snappy) but with similar compression ratio as GZIP
(that takes about 3x longer to compress) so this is what I will try.
I query my data mostly with AWS Athena that at least is SUPOSED to work
with Zstd nowadays. I tried the Parquet files created with Carpet (with
SNAPPY compression) and they worked right away with Athena even for
searching with SQL queries also for data in "structured types" so
compatibility does not seem to be a problem!
Must say I really like Carpet so far - creating Parquet from Java has never
been this easy before - I have for instance used SPARK that is nice for
doing large scale distributed analytics etc. but for just producing some
Parquet files it is really massive overkill and takes a long time to get
working. Same goes for Arrow that I also tried a bit...
…On Sun, Aug 18, 2024 at 12:35 PM Jeronimo López ***@***.***> wrote:
- I've never tried to use Brotli. I see that the implementation codec
is not included as dependency. I found this implementation
<https://github.com/rdblue/brotli-codec/tree/master>, but it's 7 years
old... In theory you just need to add the dependency to the classpath. If
you try it, please, share your experience :)
- No, I'm not getting that type of warnings (I don't have a Mac). Are
they printed using by default configuration or with a concrete compression
codec?
—
Reply to this email directly, view it on GitHub
<#36 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF4KNND3C6JPF7O4LDLZSB2GTAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGE3DANY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
The warning I mentioned seem to come when I use GZIP codec - I have googled
it extensively and followed several suggestions but nothing seem to help.
Hopefully I can use Zstd instead of GZIP and then that warning is not an
issue either :-)
…On Sun, Aug 18, 2024 at 6:12 PM Javafanboy ***@***.***> wrote:
Thanks for the suggestions - I played around a bit more with the codecs
but I did not have any luck with Brotli.
After some research I did however manage to get Zstd to work (at first
that gave a similar error and that was really why I looked at Brotli that
may be the second best thing) so it does not really matter for me any
longer if Brotli work or not!
For my data Zstd seem to give as short compression time as LZ4 (i.e.
slightly longer than Snappy) but with similar compression ratio as GZIP
(that takes about 3x longer to compress) so this is what I will try.
I query my data mostly with AWS Athena that at least is SUPOSED to work
with Zstd nowadays. I tried the Parquet files created with Carpet (with
SNAPPY compression) and they worked right away with Athena even for
searching with SQL queries also for data in "structured types" so
compatibility does not seem to be a problem!
Must say I really like Carpet so far - creating Parquet from Java has
never been this easy before - I have for instance used SPARK that is nice
for doing large scale distributed analytics etc. but for just producing
some Parquet files it is really massive overkill and takes a long time to
get working. Same goes for Arrow that I also tried a bit...
On Sun, Aug 18, 2024 at 12:35 PM Jeronimo López ***@***.***>
wrote:
>
> - I've never tried to use Brotli. I see that the implementation codec
> is not included as dependency. I found this implementation
> <https://github.com/rdblue/brotli-codec/tree/master>, but it's 7
> years old... In theory you just need to add the dependency to the
> classpath. If you try it, please, share your experience :)
> - No, I'm not getting that type of warnings (I don't have a Mac). Are
> they printed using by default configuration or with a concrete compression
> codec?
>
> —
> Reply to this email directly, view it on GitHub
> <#36 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AADXQF4KNND3C6JPF7O4LDLZSB2GTAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGE3DANY>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Beta Was this translation helpful? Give feedback.
-
Is it by the way possible to set the "compression level" of the codecs
using Carpet API? I noticed when examining the schema that for Zstd level 1
is used (which may be fine for fast compression) but I would have liked to
try out some higher levels as well...
I looked at the options for the "builder" and did not immediately see
anything that "jumped out" at me as related to this....
…On Sun, Aug 18, 2024 at 6:15 PM Javafanboy ***@***.***> wrote:
The warning I mentioned seem to come when I use GZIP codec - I have
googled it extensively and followed several suggestions but nothing seem to
help. Hopefully I can use Zstd instead of GZIP and then that warning is not
an issue either :-)
On Sun, Aug 18, 2024 at 6:12 PM Javafanboy ***@***.***> wrote:
> Thanks for the suggestions - I played around a bit more with the codecs
> but I did not have any luck with Brotli.
>
> After some research I did however manage to get Zstd to work (at first
> that gave a similar error and that was really why I looked at Brotli that
> may be the second best thing) so it does not really matter for me any
> longer if Brotli work or not!
>
> For my data Zstd seem to give as short compression time as LZ4 (i.e.
> slightly longer than Snappy) but with similar compression ratio as GZIP
> (that takes about 3x longer to compress) so this is what I will try.
>
> I query my data mostly with AWS Athena that at least is SUPOSED to work
> with Zstd nowadays. I tried the Parquet files created with Carpet (with
> SNAPPY compression) and they worked right away with Athena even for
> searching with SQL queries also for data in "structured types" so
> compatibility does not seem to be a problem!
>
> Must say I really like Carpet so far - creating Parquet from Java has
> never been this easy before - I have for instance used SPARK that is nice
> for doing large scale distributed analytics etc. but for just producing
> some Parquet files it is really massive overkill and takes a long time to
> get working. Same goes for Arrow that I also tried a bit...
>
>
>
> On Sun, Aug 18, 2024 at 12:35 PM Jeronimo López ***@***.***>
> wrote:
>
>>
>> - I've never tried to use Brotli. I see that the implementation
>> codec is not included as dependency. I found this implementation
>> <https://github.com/rdblue/brotli-codec/tree/master>, but it's 7
>> years old... In theory you just need to add the dependency to the
>> classpath. If you try it, please, share your experience :)
>> - No, I'm not getting that type of warnings (I don't have a Mac).
>> Are they printed using by default configuration or with a concrete
>> compression codec?
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#36 (reply in thread)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AADXQF4KNND3C6JPF7O4LDLZSB2GTAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGE3DANY>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***
>> com>
>>
>
|
Beta Was this translation helpful? Give feedback.
-
I just tried this but still get Zstd level 1 (at least according to "pqrs"
tool that I use to look at the generated Parquet file). The relevant part
of my test program now looks like this, you see anything that is off?
Perhaps this is also some "MacOS" problem...
void testEncoding() throws IOException {
System.out.println("Generating test data...");
for(int i = 0; i < 1_000; i++) {
data.add(generateRandomLogEventRecord());
}
System.out.println("Sorting test data...");
data.sort(Comparator.comparing(LogEventRecord::node).reversed()
.thenComparing(LogEventRecord::threadId)
.thenComparing(LogEventRecord::timeMillis)
.thenComparing(LogEventRecord::timeNanos));
System.out.println("Generating Parquet...");
long start = System.currentTimeMillis();
try (OutputStream outputStream = new
FileOutputStream("logevents.parquet")) {
final PlainParquetConfiguration conf = new PlainParquetConfiguration(
Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, "6"));
try (CarpetWriter<LogEventRecord> writer = new
CarpetWriter.Builder<>(outputStream, LogEventRecord.class)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(CompressionCodecName.ZSTD)
.withConf(conf)
.build()) {
writer.write(data);
}
}
long stop = System.currentTimeMillis();
System.out.println("ZSTD Time taken : " + (stop - start));
}
and a subset of the output from *pqrs* looks like this:
column 1:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "node"
encodings: BIT_PACKED PLAIN_DICTIONARY
file path: N/A
file offset: 430
num of values: 1000
compression: ZSTD(ZstdLevel(1))
… Message ID: <jerolba/parquet-carpet/repo-discussions/36/comments/10375029@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
Good idea - I tried it and the test showed that setting level actually DO
WORK (time increased with level and file size decreased, in this case very
marginally so I did ot notice it at first but still) so pqrs obviously have
a bug - will report it to them!
Thanks for the help!
…On Sun, Aug 18, 2024 at 10:18 PM Jeronimo López ***@***.***> wrote:
For me it's all ok... I don't see nothing wrong...
Try to write N versions of the file with different level configured and
see if the result is the same or if each file is different.
What does pqrs say?
for (int i=1;i<22;i++) {
System.out.println("Generating Parquet...");
long start = System.currentTimeMillis();
try (OutputStream outputStream = new FileOutputStream("logevents"+i+".parquet")) {
final PlainParquetConfiguration conf = new PlainParquetConfiguration(
Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, ""+i));
try (CarpetWriter<LogEventRecord> writer = new
CarpetWriter.Builder<>(outputStream, LogEventRecord.class)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(CompressionCodecName.ZSTD)
.withConf(conf)
.build()) {
writer.write(data);
}
}
long stop = System.currentTimeMillis();
System.out.println("ZSTD Time taken : " + (stop - start));
}
—
Reply to this email directly, view it on GitHub
<#36 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF55BVQJ4UJ4X7ZYTFDZSD6RVAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGU3TQOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi!
I filed an error report with pqrs and they replied that Zstd does NOT store
the compression level in Parquet files and in this situation they "just
print 1" instead of not printing anything. It will be fixed in an upcoming
release...
…On Mon, Aug 19, 2024 at 7:03 AM Javafanboy ***@***.***> wrote:
Good idea - I tried it and the test showed that setting level actually DO
WORK (time increased with level and file size decreased, in this case very
marginally so I did ot notice it at first but still) so pqrs obviously have
a bug - will report it to them!
Thanks for the help!
On Sun, Aug 18, 2024 at 10:18 PM Jeronimo López ***@***.***>
wrote:
> For me it's all ok... I don't see nothing wrong...
>
> Try to write N versions of the file with different level configured and
> see if the result is the same or if each file is different.
>
> What does pqrs say?
>
> for (int i=1;i<22;i++) {
> System.out.println("Generating Parquet...");
> long start = System.currentTimeMillis();
> try (OutputStream outputStream = new FileOutputStream("logevents"+i+".parquet")) {
> final PlainParquetConfiguration conf = new PlainParquetConfiguration(
> Map.of(ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL, ""+i));
> try (CarpetWriter<LogEventRecord> writer = new
> CarpetWriter.Builder<>(outputStream, LogEventRecord.class)
> .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
> .withCompressionCodec(CompressionCodecName.ZSTD)
> .withConf(conf)
> .build()) {
> writer.write(data);
> }
> }
> long stop = System.currentTimeMillis();
> System.out.println("ZSTD Time taken : " + (stop - start));
> }
>
> —
> Reply to this email directly, view it on GitHub
> <#36 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AADXQF55BVQJ4UJ4X7ZYTFDZSD6RVAVCNFSM6AAAAABMR3GEK6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGU3TQOA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Beta Was this translation helpful? Give feedback.
-
I am looking for a way to using a normal Java program, with a minimum of libraries etc (not using Spark or Arrow for instance) , create large Parquet objects in S3 without keeping all the data in memory or store it to file first i.e. I would like to write the data to an output stream that uploads to S3. I have a stable implementation of such a stream and before starting to play around with Carpet I would like to know if it supports the use of "any output stream" and if so if it flushes data after each row group or only after writing the whole dataset including the header?
I started looking at the standard Apache Parquet library but it only seem to work with "files" and as known also have the problem with all the dependencies (I have not found any info on exactly what can be excluded in my macev file, trial and error seem time consuming).... An option to get around the file limitation is to use memory mapped files but this also seem quite time consuming given the quite sparse documentation of Hadoop files etc...
Any thoughts or advice is warmly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions