Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: allow floats as timestamp column #777

Closed
wants to merge 2 commits into from

Conversation

jordanrfrazier
Copy link
Collaborator

@jordanrfrazier jordanrfrazier commented Sep 26, 2023

Allows timestamp column to be float for parquet sources in the new prepare path.

Also fixes the time_unit, which was previously not being taken into account.

Closes #776

@cla-bot cla-bot bot added the cla-signed Set when all authors of a PR have signed our CLA label Sep 26, 2023
@github-actions github-actions bot added bug Something isn't working sparrow labels Sep 26, 2023
} else {
Ok(time.clone())
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this is halfway duplicated from the prepare_batch code. It's not a complete duplicate, but there's definitely room to simplify, but it would require more refactoring of the column_behavior code. Not sure how much we want to do now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best options seem to be either:

  1. Drop the column behavior stuff and have both paths call prepare_batch (the newer code).
  2. Add a column behavior corresponding to numeric_to_timestamp which prepare_batch uses for the timestamp case.

Either would let us avoid adding as much code to column_behavior.

@@ -24,8 +25,15 @@ pub enum ColumnBehavior {
/// Cast the given column to the given data type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use / change the column behavior? I think we could just use the prepare_batch method and avoid that. Alternatively, it seems like we could have a case here that called into the numeric_to_timestamp and at least shared that case (rather than adding more complexity to column behaviors).

} else {
Ok(time.clone())
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best options seem to be either:

  1. Drop the column behavior stuff and have both paths call prepare_batch (the newer code).
  2. Add a column behavior corresponding to numeric_to_timestamp which prepare_batch uses for the timestamp case.

Either would let us avoid adding as much code to column_behavior.

python/pytests/parquet_source_test.py Show resolved Hide resolved
)
golden.jsonl(source)

async def test_time_column_as_float_can_cast_ns(golden) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla-signed Set when all authors of a PR have signed our CLA sparrow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading parquet sources with Float timestamp is broken
2 participants