GH-3080: HadoopStreams to support ByteBufferPositionedReadable #3096

steveloughran · 2024-12-03T20:45:00Z

Rationale for this change

If a stream declares in its StreamCapabilities that it supports
ByteBufferPositionedReadable, then use that API for
readFully(ByteBuffer)

ByteBufferPositionedReadable.readFully(long position, ByteBuffer buf)

Adding support for Hadoop ByteBufferPositionedReadable streams may improve performance
by pushing retry/recovery logic into the filesystem client library.

This interface is implemented by the HDFS input stream; we are considering adding
it elsewhere.

What changes are included in this PR?

New SeekableInputStream implementation: H3ByteBufferInputStream
Instantiated in HadoopStreams if the FSDataInputStream is considered suitable.
Tests for the new behavior and that no regressions are caused.

Class `H3ByteBufferInputStream`

The reading is done in a new class, H3ByteBufferInputStream, which subclasses H2ByteBufferInputStream. This reduces the amount of duplicate code, it just makes it a bit unclean.

The purist way to do it would be to create an abstract superclass HadoopInputStream to hold all commonality between the the three input streams.

I'm happy to do this, just didn't want to doing some larger refactoring without (a) showing the core design worked and (b) getting permission to do this. Should I do this?

`HadoopStreams` changes

Selection of the new input stream is done if and only if the stream declares the capability in:preadbytebuffer.
There is no equivalent of isWrappedStreamByteBufferReadable() which recurses through
a chain of wrapped streams looking for the API.
If a stream doesn't declare its support for the API, it won't get picked up.
This is done knowing that the sole production implemenation which currently exists,
the HDFS input stream, does declare this capability.

Are these changes tested?

There is new test suite, for new behavior and ensuring that the integration with
HadoopStreams still retains the correct behavior for existing streams.
Suite is parameterized on heap and direct buffers.

Are there any user-facing changes?

No

Closes GH-3080

…input streams

Based of the H2 stream test suite but * parameterized for on/off heap * expect no changes in buffer contents on out of range reads. Still one test failure.

* changing how stream capabilities are set up and queried, makes it easy to generate streams with different declared behaviours. * pull out common assertions * lots of javadoc of what each test case is trying to do. + all the tests are happy.

wgtmac

Thanks for adding this! I've left some comments.

wgtmac · 2024-12-13T14:49:11Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java


-    if (isByteBufferReadable) {
+    if (stream.hasCapability(READBYTEBUFFER)) {


We can avoid any reflection here because of the Hadoop version bump?

wgtmac · 2024-12-13T14:51:04Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java

-   * @param stream stream to probe
-   * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
-   */
-  private static Function<FSDataInputStream, SeekableInputStream> unwrapByteBufferReadableLegacy(


Is there any behavior change of a wrapped stream after removing unwrapByteBufferReadableLegacy?

wgtmac · 2024-12-13T14:52:20Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/H2SeekableInputStream.java

@@ -100,6 +100,10 @@ public void readVectored(List<ParquetFileRange> ranges, ByteBufferAllocator allo
    VectorIoBridge.instance().readVectoredRanges(stream, ranges, allocator);
  }

+  protected Reader getReader() {


Why adding this but not used elsewhere?

wgtmac · 2024-12-13T14:53:54Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/H3ByteBufferInputStream.java

+   * @param buf a byte buffer to fill with data from the stream
+   * @return number of bytes read.
+   *
+   * @throws EOFException the buffer length is greater than the file length


Add EOFException to the method signature?

wgtmac · 2024-12-13T14:56:22Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/H3ByteBufferInputStream.java

+ * {@code ByteBufferPositionedReadable.readFully()}.
+ * <p>This is implemented by HDFS and possibly other clients,
+ */
+class H3ByteBufferInputStream extends H2SeekableInputStream {


I'm fine with the inheritance to reduce code duplication.

steveloughran · 2024-12-20T20:17:40Z

I'm away until 2025; will reply to comments then. Thanks for the review.

steveloughran · 2025-01-14T19:53:36Z

I'm back, don't think I've forgotten this. In fact I've been actually setting up a test-only-loop for hadoop for regression testing parquet support through the cloud connectors.
apache/hadoop#7285

steveloughran added 3 commits December 3, 2024 15:47

apacheGH-3080: HadoopStreams to support ByteBufferPositionedReadable …

87e2464

…input streams

apacheGH-3080: Tests of the new input stream.

cccde45

Based of the H2 stream test suite but * parameterized for on/off heap * expect no changes in buffer contents on out of range reads. Still one test failure.

apacheGH-3080: lots of work on the tests.

455403e

* changing how stream capabilities are set up and queried, makes it easy to generate streams with different declared behaviours. * pull out common assertions * lots of javadoc of what each test case is trying to do. + all the tests are happy.

wgtmac reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3080: HadoopStreams to support ByteBufferPositionedReadable #3096

GH-3080: HadoopStreams to support ByteBufferPositionedReadable #3096

steveloughran commented Dec 3, 2024

wgtmac left a comment

wgtmac Dec 13, 2024

wgtmac Dec 13, 2024

wgtmac Dec 13, 2024

wgtmac Dec 13, 2024

wgtmac Dec 13, 2024

steveloughran commented Dec 20, 2024

steveloughran commented Jan 14, 2025


		if (isByteBufferReadable) {
		if (stream.hasCapability(READBYTEBUFFER)) {

GH-3080: HadoopStreams to support ByteBufferPositionedReadable #3096

Are you sure you want to change the base?

GH-3080: HadoopStreams to support ByteBufferPositionedReadable #3096

Conversation

steveloughran commented Dec 3, 2024

Rationale for this change

What changes are included in this PR?

Class H3ByteBufferInputStream

HadoopStreams changes

Are these changes tested?

Are there any user-facing changes?

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac Dec 13, 2024

Choose a reason for hiding this comment

wgtmac Dec 13, 2024

Choose a reason for hiding this comment

wgtmac Dec 13, 2024

Choose a reason for hiding this comment

wgtmac Dec 13, 2024

Choose a reason for hiding this comment

wgtmac Dec 13, 2024

Choose a reason for hiding this comment

steveloughran commented Dec 20, 2024

steveloughran commented Jan 14, 2025

Class `H3ByteBufferInputStream`

`HadoopStreams` changes