Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand parquet crate overview doc #5093

Merged
merged 6 commits into from
Nov 20, 2023
Merged

Expand parquet crate overview doc #5093

merged 6 commits into from
Nov 20, 2023

Conversation

mmaitre314
Copy link
Contributor

Which issue does this PR close?

This is a doc-only update converting the discussions in #5064 and #2394 into docs. It closes neither of these PRs -- code fixes are needed for that.

Rationale for this change

The change clarifies which APIs are available in the parquet crate, what their intents are, how they relate to one-another, and which should be used when.

Are there any user-facing changes?

Doc update

@tustvold
Copy link
Contributor

tustvold commented Nov 17, 2023

This looks fantastic, thank you. I'm not at a computer currently, I'll double check things like formatting when I am, but from a quick read through the content looks 👌

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this, I moved some things around but what you wrote was already very good, I was just inspired to make some further changes

parquet/README.md Show resolved Hide resolved
parquet/Cargo.toml Outdated Show resolved Hide resolved
parquet/src/arrow/async_reader/store.rs Outdated Show resolved Hide resolved
/// # use parquet::schema::printer::print_parquet_metadata;
/// # async fn main() {
/// // Populate configuration from environment
/// let storage_container = Arc::new(MicrosoftAzureBuilder::from_env().build()?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked this to use from_env to cut down on boilerplate

parquet/src/file/mod.rs Show resolved Hide resolved
parquet/src/lib.rs Show resolved Hide resolved
parquet/src/lib.rs Outdated Show resolved Hide resolved
parquet/src/lib.rs Outdated Show resolved Hide resolved
parquet/src/lib.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

As I've made some non-trivial modifications I intend to leave this open for until Monday, to give other reviewers time to comment

@mmaitre314
Copy link
Contributor Author

As I've made some non-trivial modifications I intend to leave this open for until Monday, to give other reviewers time to comment

Thanks! The update looks good to me. The one piece I would keep out would be the use of from_env(), because it changes the scenario. It looks like from_env() assumes there is a single Storage Account with a single Container and only the Blob Path changes each time, which is not very typical in data pipelines where datasets may come from different teams and may be partitioned to avoid throttling, to implement different access control lists, or abide by different regional regulatory requirements. The one boiler-plate line is .with_use_azure_cli(true). Were an official Azure SDK available, this would not be there and instead DefaultAzureCredential would be used to handle auth transparently and have the same code run during local development and remote deployment.

@tustvold
Copy link
Contributor

My intention was to focus on showing how to configure integration with object_store, not provide an exhaustive list of all the various ways one might configure it. from_env provides a minimal viable baseline that people can then override as they see fit within their applications

@tustvold tustvold merged commit 6815bf1 into apache:master Nov 20, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants