Unable to use Zstd compression with parquet #764

kartik18 · 2024-07-03T18:03:42Z

Hi Everyone,

I'm using the cp-kafka-connect-base:7.50 docker image with the kafka-connect-s3:10.5.13 plugin installed inside it. We can write data in Parquet format from Kafka topic to S3.
When I'm trying to use the configuration parquet.code: zstd for writing the data in a compressed way, I'm facing an error. With a stack trace -
"org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:618)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:336)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:237)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:206)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)\n\tat org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support.\n\tat org.apache.hadoop.io.compress.ZStandardCodec.checkNativeCodeLoaded(ZStandardCodec.java:65)\n\tat org.apache.hadoop.io.compress.ZStandardCodec.getCompressorType(ZStandardCodec.java:153)\n\tat org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150)\n\tat org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)\n\tat org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.<init>(CodecFactory.java:144)\n\tat org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)\n\tat org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)\n\tat org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:287)\n\tat org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:564)\n\tat io.confluent.connect.s3.format.parquet.ParquetRecordWriterProvider$1.write(ParquetRecordWriterProvider.java:102)\n\tat io.confluent.connect.s3.format.S3RetriableRecordWriter.write(S3RetriableRecordWriter.java:51)\n\tat io.confluent.connect.s3.format.KeyValueHeaderRecordWriterProvider$1.write(KeyValueHeaderRecordWriterProvider.java:114)\n\tat io.confluent.connect.s3.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:592)\n\tat io.confluent.connect.s3.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:327)\n\tat io.confluent.connect.s3.TopicPartitionWriter.executeState(TopicPartitionWriter.java:267)\n\tat io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:218)\n\tat io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:244)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:587)\n\t...

I see there is already an issue raised earlier - #570

Question -

Looking at the pom.xml file, it is mentioned Hadoop Version 3.3.6 is used and libhadoop.so.1.0.0 is already zstd compatible. Then why we are facing this issue with the latest version?
Even after replicating the solution mentioned by one of the users (github-louis-fruleux), I'm still facing the same issue.
P.S. What I did was

Downloaded - https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Unzip and copy libhadoop.so.1.0.0 in the docker image
Dockerfile - FROM confluentinc/cp-kafka-connect-base:7.50 ENV CONNECT_PLUGIN_PATH=/usr/share/java,/usr/share/confluent-hub-components RUN confluent-hub install --no-prompt confluentinc/kafka-connect-s3:10.5.13 COPY libhadoop.so.1.0.0 /usr/lib64/libhadoop.so.1.0.0 COPY scripts/libhadoop.so.1.0.0 /usr/lib64/libhadoop.so RUN ls -ltr /usr/lib64/ ENV KAFKA_OPTS="${KAFKA_OPTS} -Djava.library.path=/usr/lib64/" RUN echo "KAFKA_OPTS value: $KAFKA_OPTS" CMD ["/etc/confluent/docker/run"]

How should I fix this issue?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use Zstd compression with parquet #764

Unable to use Zstd compression with parquet #764

kartik18 commented Jul 3, 2024 •

edited

Loading

Unable to use Zstd compression with parquet #764

Unable to use Zstd compression with parquet #764

Comments

kartik18 commented Jul 3, 2024 • edited Loading

kartik18 commented Jul 3, 2024 •

edited

Loading