Skip to content

Feat/deterministic metadata encoding #7437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

timsaucer
Copy link
Contributor

Which issue does this PR close?

None, but I can open one if necessary.

Rationale for this change

The ordering of metadata is not consistent since it uses a HashMap. It can be useful in unit tests to verify an output from a known hash of it's serialized values. With metadata this is not consistent.

What changes are included in this PR?

Adds ordering to the hashmap keys when encoding.

Are there any user-facing changes?

No.

Example

If you run this example multiple times, you will see the encoding changes from run to run based on the non-deterministic ordering of the hashmap iterator.

use std::{hash::Hasher, sync::Arc};

use arrow::{array::RecordBatch, datatypes::Schema};

fn main() {
    let schema = Arc::new(
        Schema::empty().with_metadata(
            [
                ("a".to_owned(), "1".to_owned()), //
                ("b".to_owned(), "2".to_owned()), //
                ("c".to_owned(), "3".to_owned()), //
                ("d".to_owned(), "4".to_owned()), //
                ("e".to_owned(), "5".to_owned()), //
            ]
            .into_iter()
            .collect(),
        ),
    );
    let batch = RecordBatch::new_empty(schema.clone());

    dbg!(&batch.schema().metadata().keys());

    let mut bytes = Vec::new();
    let mut w = arrow::ipc::writer::StreamWriter::try_new(&mut bytes, &schema).unwrap();
    w.write(&batch).unwrap();
    w.finish().unwrap();

    let mut h = std::hash::DefaultHasher::new();
    h.write(&bytes);
    let h = h.finish();

    eprintln!("{} bytes -- h = {h:x}", bytes.len());
}

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 23, 2025
@timsaucer
Copy link
Contributor Author

I currently based this off 54.3.1 but I will updated it to main after we have completed internal testing.

@timsaucer timsaucer force-pushed the feat/deterministic-metadata-encoding branch from ee273f6 to 5027767 Compare April 24, 2025 11:21
@timsaucer timsaucer marked this pull request as ready for review April 24, 2025 11:21
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Apr 24, 2025
@timsaucer
Copy link
Contributor Author

Rebased on main, ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant