Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use Zstd compression with parquet #764

Open
kartik18 opened this issue Jul 3, 2024 · 0 comments
Open

Unable to use Zstd compression with parquet #764

kartik18 opened this issue Jul 3, 2024 · 0 comments

Comments

@kartik18
Copy link

kartik18 commented Jul 3, 2024

Hi Everyone,

I'm using the cp-kafka-connect-base:7.50 docker image with the kafka-connect-s3:10.5.13 plugin installed inside it. We can write data in Parquet format from Kafka topic to S3.
When I'm trying to use the configuration parquet.code: zstd for writing the data in a compressed way, I'm facing an error. With a stack trace -
"org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:618)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:336)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:237)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:206)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)\n\tat org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support.\n\tat org.apache.hadoop.io.compress.ZStandardCodec.checkNativeCodeLoaded(ZStandardCodec.java:65)\n\tat org.apache.hadoop.io.compress.ZStandardCodec.getCompressorType(ZStandardCodec.java:153)\n\tat org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150)\n\tat org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)\n\tat org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.<init>(CodecFactory.java:144)\n\tat org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)\n\tat org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)\n\tat org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:287)\n\tat org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:564)\n\tat io.confluent.connect.s3.format.parquet.ParquetRecordWriterProvider$1.write(ParquetRecordWriterProvider.java:102)\n\tat io.confluent.connect.s3.format.S3RetriableRecordWriter.write(S3RetriableRecordWriter.java:51)\n\tat io.confluent.connect.s3.format.KeyValueHeaderRecordWriterProvider$1.write(KeyValueHeaderRecordWriterProvider.java:114)\n\tat io.confluent.connect.s3.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:592)\n\tat io.confluent.connect.s3.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:327)\n\tat io.confluent.connect.s3.TopicPartitionWriter.executeState(TopicPartitionWriter.java:267)\n\tat io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:218)\n\tat io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:244)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:587)\n\t...

I see there is already an issue raised earlier - #570

Question -

  1. Looking at the pom.xml file, it is mentioned Hadoop Version 3.3.6 is used and libhadoop.so.1.0.0 is already zstd compatible. Then why we are facing this issue with the latest version?
  2. Even after replicating the solution mentioned by one of the users (github-louis-fruleux), I'm still facing the same issue.
    P.S. What I did was
  • Downloaded - https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
  • Unzip and copy libhadoop.so.1.0.0 in the docker image
  • Dockerfile - FROM confluentinc/cp-kafka-connect-base:7.50 ENV CONNECT_PLUGIN_PATH=/usr/share/java,/usr/share/confluent-hub-components RUN confluent-hub install --no-prompt confluentinc/kafka-connect-s3:10.5.13 COPY libhadoop.so.1.0.0 /usr/lib64/libhadoop.so.1.0.0 COPY scripts/libhadoop.so.1.0.0 /usr/lib64/libhadoop.so RUN ls -ltr /usr/lib64/ ENV KAFKA_OPTS="${KAFKA_OPTS} -Djava.library.path=/usr/lib64/" RUN echo "KAFKA_OPTS value: $KAFKA_OPTS" CMD ["/etc/confluent/docker/run"]

How should I fix this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant