Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39456: [Go][Parquet] Arrow DATE64 Type Coerced to Parquet DATE Logical Type #39460

Merged
merged 2 commits into from
Jan 9, 2024

Conversation

joellubi
Copy link
Member

@joellubi joellubi commented Jan 5, 2024

Rationale for this change

Closes: #39456

What changes are included in this PR?

Update physical and logical type mapping from Arrow to Parquet for DATE64 type

Are these changes tested?

Yes,

  • Update expected schema mapping in existing test
  • Tests asserting new behavior
    • Arrow DATE64 will roundtrip -> Parquet -> Arrow as DATE32
    • Arrow DATE64 not aligned to exact date boundary will truncate to milliseconds at boundary of greatest full day on Parquet roundtrip

Are there any user-facing changes?

Yes, users of pqarrow.FileWriter will produce Parquet files containing DATE logical type instead of TIMESTAMP[ms] when writing Arrow data containing DATE64 field(s). The proposed implementation truncates int64 values to be divisible by 86400000 rather than validating that this is already the case, as some implementations do. I am happy to add this validation if it would be preferred, but the truncating behavior will likely break fewer existing users.

I'm not sure whether this is technically considered a breaking change to a public API and if/how it should be communicated. Any direction regarding this would be appreciated.

@joellubi joellubi requested a review from zeroshade as a code owner January 5, 2024 00:54
Copy link

github-actions bot commented Jan 5, 2024

⚠️ GitHub issue #39456 has been automatically assigned in GitHub to PR creator.

@zeroshade zeroshade added Breaking Change Includes a breaking change to the API Component: Parquet labels Jan 9, 2024
@zeroshade zeroshade merged commit eade938 into apache:main Jan 9, 2024
25 checks passed
@zeroshade zeroshade removed the awaiting review Awaiting review label Jan 9, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit eade938.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…TE Logical Type (apache#39460)

### Rationale for this change

Closes: apache#39456 

### What changes are included in this PR?

Update physical and logical type mapping from Arrow to Parquet for DATE64 type

### Are these changes tested?

Yes,
- Update expected schema mapping in existing test
- Tests asserting new behavior
  - Arrow DATE64 will roundtrip -> Parquet -> Arrow as DATE32
  - Arrow DATE64 _not aligned_ to exact date boundary will truncate to milliseconds at boundary of greatest full day on Parquet roundtrip

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files containing `DATE` logical type instead of `TIMESTAMP[ms]` when writing Arrow data containing DATE64 field(s). The proposed implementation truncates `int64` values to be divisible by 86400000 rather than validating that this is already the case, as some implementations do. I am happy to add this validation if it would be preferred, but the truncating behavior will likely break fewer existing users.

I'm not sure whether this is technically considered a breaking change to a public API and if/how it should be communicated. Any direction regarding this would be appreciated.

* Closes: apache#39456

Authored-by: Joel Lubinitsky <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…TE Logical Type (apache#39460)

### Rationale for this change

Closes: apache#39456 

### What changes are included in this PR?

Update physical and logical type mapping from Arrow to Parquet for DATE64 type

### Are these changes tested?

Yes,
- Update expected schema mapping in existing test
- Tests asserting new behavior
  - Arrow DATE64 will roundtrip -> Parquet -> Arrow as DATE32
  - Arrow DATE64 _not aligned_ to exact date boundary will truncate to milliseconds at boundary of greatest full day on Parquet roundtrip

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files containing `DATE` logical type instead of `TIMESTAMP[ms]` when writing Arrow data containing DATE64 field(s). The proposed implementation truncates `int64` values to be divisible by 86400000 rather than validating that this is already the case, as some implementations do. I am happy to add this validation if it would be preferred, but the truncating behavior will likely break fewer existing users.

I'm not sure whether this is technically considered a breaking change to a public API and if/how it should be communicated. Any direction regarding this would be appreciated.

* Closes: apache#39456

Authored-by: Joel Lubinitsky <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…TE Logical Type (apache#39460)

### Rationale for this change

Closes: apache#39456 

### What changes are included in this PR?

Update physical and logical type mapping from Arrow to Parquet for DATE64 type

### Are these changes tested?

Yes,
- Update expected schema mapping in existing test
- Tests asserting new behavior
  - Arrow DATE64 will roundtrip -> Parquet -> Arrow as DATE32
  - Arrow DATE64 _not aligned_ to exact date boundary will truncate to milliseconds at boundary of greatest full day on Parquet roundtrip

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files containing `DATE` logical type instead of `TIMESTAMP[ms]` when writing Arrow data containing DATE64 field(s). The proposed implementation truncates `int64` values to be divisible by 86400000 rather than validating that this is already the case, as some implementations do. I am happy to add this validation if it would be preferred, but the truncating behavior will likely break fewer existing users.

I'm not sure whether this is technically considered a breaking change to a public API and if/how it should be communicated. Any direction regarding this would be appreciated.

* Closes: apache#39456

Authored-by: Joel Lubinitsky <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Includes a breaking change to the API Component: Go Component: Parquet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Go][Parquet] Arrow DATE64 type is coerced into Parquet TIMESTAMP[ms] logical type instead of DATE (32-bit)
2 participants