Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation fix: example in parquet/src/column/mod.rs is incorrect #5560

Closed
zgershkoff opened this issue Mar 27, 2024 · 1 comment · Fixed by #5561
Closed

Documentation fix: example in parquet/src/column/mod.rs is incorrect #5560

zgershkoff opened this issue Mar 27, 2024 · 1 comment · Fixed by #5561
Labels
bug parquet Changes to the parquet crate

Comments

@zgershkoff
Copy link
Contributor

Describe the bug

If I try to compile and run the example given in the documentation for parquet::column, the assertions at the end fail.

To Reproduce

cargo build with the following in main.rs:

use std::fs;
use parquet::column::reader::ColumnReader;
use parquet::file::reader::FileReader;
use parquet::file::serialized_reader::SerializedFileReader;
use parquet::data_type::Int32Type;
use parquet::file::writer::SerializedFileWriter;
use std::sync::Arc;
use std::path::Path;
use parquet::schema::parser::parse_message_type;


fn main() {
    let path = Path::new("column_sample.parquet");

    // Writing data using column writer API.

    let message_type = "
      message schema {
        optional group values (LIST) {
          repeated group list {
            optional INT32 element;
          }
        }
      }
    ";
    let schema = Arc::new(parse_message_type(message_type).unwrap());
    let file = fs::File::create(path).unwrap();
    let mut writer = SerializedFileWriter::new(file, schema, Default::default()).unwrap();

    let mut row_group_writer = writer.next_row_group().unwrap();
    while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
        col_writer
            .typed::<Int32Type>()
            .write_batch(&[1, 2, 3], Some(&[3, 3, 3, 2, 2]), Some(&[0, 1, 0, 1, 1]))
            .unwrap();
        col_writer.close().unwrap();
    }
    row_group_writer.close().unwrap();

    writer.close().unwrap();

    // Reading data using column reader API.

    let file = fs::File::open(path).unwrap();
    let reader = SerializedFileReader::new(file).unwrap();
    let metadata = reader.metadata();

    let mut values = vec![0; 8];
    let mut def_levels = vec![0; 8];
    let mut rep_levels = vec![0; 8];

    for i in 0..metadata.num_row_groups() {
        let row_group_reader = reader.get_row_group(i).unwrap();
        let row_group_metadata = metadata.row_group(i);

        for j in 0..row_group_metadata.num_columns() {
            let mut column_reader = row_group_reader.get_column_reader(j).unwrap();
            match column_reader {
                // You can also use `get_typed_column_reader` method to extract typed reader.
                ColumnReader::Int32ColumnReader(ref mut typed_reader) => {
                    let (records, values, levels) = typed_reader.read_records(
                        8, // maximum records to read
                        Some(&mut def_levels),
                        Some(&mut rep_levels),
                        &mut values,
                    ).unwrap();
                    assert_eq!(records, 2);
                    assert_eq!(levels, 5);
                    assert_eq!(values, 3);
                }
                _ => {}
            }
        }
    }

    assert_eq!(values, vec![1, 2, 3, 0, 0, 0, 0, 0]);
    assert_eq!(def_levels, vec![3, 3, 3, 2, 2, 0, 0, 0]);
    assert_eq!(rep_levels, vec![0, 1, 0, 1, 1, 0, 0, 0]);
}

Expected behavior

The assertions should be correct. I'm surprised that the examples in the documentation don't compile and run as part of the test suite.

Additional context

This line https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader/decoder.rs#L236 appears to be the culprit, in that it resizes the values vector, and then only passes the new part of the vector to self.decoder.as_mut().unwrap().read(). This was changed 3 months ago as part of #5177.

@zgershkoff zgershkoff added the bug label Mar 27, 2024
zgershkoff added a commit to zgershkoff/arrow-rs that referenced this issue Mar 27, 2024
tustvold pushed a commit that referenced this issue Mar 28, 2024
@tustvold tustvold added the parquet Changes to the parquet crate label Apr 17, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #5561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants