[C++] Order of row index streams does not match the order of streams in the file footer #1475

vuule · 2023-04-21T21:17:17Z

When writing a file with a string column and multiple row groups, the resulting file has incorrect row index streams.
The string column is encoded using direct encoding. The file footer contains the LENGTH (kind 2) stream before DATA (kind 1) stream. However, the row index seems to contain the index data for the DATA stream before the LENGTH stream. Switching out the order in which we read the row index streams fixes the issue and everything can be used correctly.

Isolation:
Only observing this behavior with string columns. Other types with multiple streams look correct in this regard.
Behavior looks unrelated to string content in the column.
No info on dictionary encoded string columns - writer seemlingly defaults to direct encoding.

See attached repro file. The file contains a single string column, with ["*"] * 10001
10001_strings.zip

The text was updated successfully, but these errors were encountered:

vuule · 2023-04-21T21:18:46Z

Logs that point to incorrect order: rapidsai/cudf#11890 (comment)

wgtmac · 2023-04-23T15:24:54Z

Thanks for reporting the issue! @vuule

The order of data streams are NOT FIXED meaning that:

In a direct-encoded string columns, DATA stream can be placed BEFORE or AFTER LENGTH stream. Same flexibility for PRESENT stream.
Even data streams of different columns can be interleaved.

However, the order of positions in a index stream is FIXED. So for a direct-encoded string column, its INDEX stream always put positions in this order: PRESENT stream (if exists), DATA stream and LENGTH stream.

I checked the specs and it does not state this clearly. It would be a good time to document this as well. @deshanxiao

deshanxiao · 2023-04-24T02:38:41Z

Thank you @wgtmac , I will add these to #1465.

wgtmac · 2023-04-24T02:39:45Z

Thank you @wgtmac , I will add these to #1465.

Thanks @deshanxiao ! Could you also verify that the java implementation matches this behavior?

vuule · 2023-04-24T17:26:43Z

Thank you for the clarification @wgtmac !
What about other cases when a column has multiple streams? Is the order of index streams always the same as in the tables at https://orc.apache.org/specification/ORCv1/?

wgtmac · 2023-04-25T01:41:06Z

Yes, the order is fixed. This is implemented in the recordPosition call as below.

In the TreeWriterBase.java, positions of present stream are recorded first.

orc/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java

Lines 369 to 377 in 792c3f8

    
             /** 
        
              * Record the current position in each of this column's streams. 
        
              * @param recorder where should the locations be recorded 
        
              */ 
        
             void recordPosition(PositionRecorder recorder) throws IOException { 
        
               if (isPresent != null) { 
        
                 isPresent.getPosition(recorder); 
        
               } 
        
             }

And then in the StringBaseTreeWriter.java, positions of data stream and length stream are recorded in order.

orc/java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java

Lines 265 to 270 in 9dbf833

    
           private void recordDirectStreamPosition() throws IOException { 
        
             if (rowIndexPosition != null) { 
        
               directStreamOutput.getPosition(rowIndexPosition); 
        
               lengthOutput.getPosition(rowIndexPosition); 
        
             } 
        
           }

I followed the same order when I was implementing the C++ writer so they should be consistent.

wgtmac · 2023-04-25T01:42:21Z

Thank you for the clarification @wgtmac ! What about other cases when a column has multiple streams? Is the order of index streams always the same as in the tables at https://orc.apache.org/specification/ORCv1/?

IIRC, the order is same as the table of the spec doc.

deshanxiao · 2023-04-25T13:50:04Z

Yes, the order is fixed. This is implemented in the recordPosition call as below.

In the TreeWriterBase.java, positions of present stream are recorded first.

orc/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java

Lines 369 to 377 in 792c3f8

/**

* Record the current position in each of this column's streams.

* @param recorder where should the locations be recorded

*/

void recordPosition(PositionRecorder recorder) throws IOException {

if (isPresent != null) {

isPresent.getPosition(recorder);

}

}

And then in the StringBaseTreeWriter.java, positions of data stream and length stream are recorded in order.

orc/java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java

Lines 265 to 270 in 9dbf833

private void recordDirectStreamPosition() throws IOException {

if (rowIndexPosition != null) {

directStreamOutput.getPosition(rowIndexPosition);

lengthOutput.getPosition(rowIndexPosition);

}

}

I followed the same order when I was implementing the C++ writer so they should be consistent.

Thank you for sharing the Java code. I double check it and you are right @wgtmac .

In a direct-encoded string columns, DATA stream can be placed BEFORE or AFTER LENGTH stream. Same flexibility for PRESENT stream.

In fact, different languages currently have different order implementations. The order of java depends on the method of compareTo to flush the stream to disk.

Even data streams of different columns can be interleaved.

Do you mean that the streams will cross for different columns like:
col1 streamtype1
col2 streamtype1
col1 streamtype2

I notice that the streams in the same column will appear together, but the order of the streams in different column is uncertain even they are the same data type.

deshanxiao · 2023-04-25T13:51:54Z

BTW, Is it necessary for us to add a type list in IndexEntry to describe the type of the position? @wgtmac @dongjoon-hyun @guiyanakuang

wgtmac · 2023-04-25T14:16:44Z

Yes, that would help a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Order of row index streams does not match the order of streams in the file footer #1475

[C++] Order of row index streams does not match the order of streams in the file footer #1475

vuule commented Apr 21, 2023

vuule commented Apr 21, 2023

wgtmac commented Apr 23, 2023

deshanxiao commented Apr 24, 2023

wgtmac commented Apr 24, 2023

vuule commented Apr 24, 2023

wgtmac commented Apr 25, 2023

wgtmac commented Apr 25, 2023

deshanxiao commented Apr 25, 2023 •

edited

Loading

deshanxiao commented Apr 25, 2023

wgtmac commented Apr 25, 2023

[C++] Order of row index streams does not match the order of streams in the file footer #1475

[C++] Order of row index streams does not match the order of streams in the file footer #1475

Comments

vuule commented Apr 21, 2023

vuule commented Apr 21, 2023

wgtmac commented Apr 23, 2023

deshanxiao commented Apr 24, 2023

wgtmac commented Apr 24, 2023

vuule commented Apr 24, 2023

wgtmac commented Apr 25, 2023

wgtmac commented Apr 25, 2023

deshanxiao commented Apr 25, 2023 • edited Loading

deshanxiao commented Apr 25, 2023

wgtmac commented Apr 25, 2023

deshanxiao commented Apr 25, 2023 •

edited

Loading