-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Proof-of-concept Parquet GEOMETRY logical type implementation #43196
Conversation
d6c8c70
to
6d3a12d
Compare
aa5424d
to
b34b1c2
Compare
cpp/src/parquet/statistics.h
Outdated
@@ -24,6 +24,7 @@ | |||
#include <string> | |||
#include <utility> | |||
|
|||
#include "parquet/geometry_util.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use forward declaration? This is a public header.
} | ||
}; | ||
|
||
class WKBBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move them into arrow/util instead of parquet only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking twice, I think we can keep this in the parquet module. But it should be renamed to geometry_util_internal.h
to avoid making it public.
@paleolimbot Hey Dewey, I know you have been busy at other projects. We really want to speed up the process of the Parquet Geometry PR since the proposal and Java POC is already done. Do you have an estimated timeline for the C++ POC implementation? If you have too much on your plate, we will be happy to take over or help with the C++ POC. |
Sorry for the slow reply, I've been out of office! You are of course welcome to continue on top of this PR...I'm back in office Thursday and can definitely review. I think the main remaining portion is the Arrow IO and serde of the statistics. |
@paleolimbot Thanks for the message. We will soon create a PR against your branch! |
crs = type.GEOMETRY.crs; | ||
} | ||
|
||
LogicalType::GeometryEdges::edges edges = LogicalType::GeometryEdges::UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw when edges is UNKNOWN? It is a required field by design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was emulating how enum parameters are handled in the date/time types here (but I may not have been!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in fb134d3
} | ||
|
||
LogicalType::GeometryEncoding::geometry_encoding encoding = | ||
LogicalType::GeometryEncoding::UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, this is a required field by design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in fb134d3
@@ -135,6 +186,7 @@ class PARQUET_EXPORT EncodedStatistics { | |||
bool has_max = false; | |||
bool has_null_count = false; | |||
bool has_distinct_count = false; | |||
bool has_geometry_statistics = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know other fields already have has_xxx
flags before C++11. Should we use std::optional<EncodedGeometryStatistics>
to make it simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My sense would be to handle changing that pattern with a dedicated PR but I have no opinions either way 🙂
return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make( | ||
descr, encoded_min, encoded_max, num_values, null_count, distinct_count, | ||
has_min_max, has_null_count, has_distinct_count, pool)); | ||
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this will break downstream users as it is a public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would adding the two new arguments to the end of the argument list (with default values) make this a non-breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make it ABI-compatible or source-compatible? If ABI compatibility is required we must add a new override version of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think API-compatible is sufficient. Please correct me if I'm wrong @pitrou
bounder_.ReadGeometryTypes(other.bounder_.WkbTypes()); | ||
} | ||
|
||
void Update(const ByteArray* values, int64_t num_values, int64_t null_count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a UpdateSpaced
variant which deals with inputs with validity bits, like what TypeStatistics::UpdateSpaced
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added UpdateSpaced in fb134d3, will add tests for it later.
} | ||
}; | ||
|
||
class WKBBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking twice, I think we can keep this in the parquet module. But it should be renamed to geometry_util_internal.h
to avoid making it public.
class WKBGenericSequenceBounder { | ||
public: | ||
WKBGenericSequenceBounder() | ||
: xy_(chunk_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to keep 8 variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is me being fancy and trying to avoid branching 🙂 . A runtime switch is probably fine for a POC and it's possibly worth benchmarking before introducing this complexity.
Thanks @paleolimbot for initiating the PoC! I've left some inline comments on this PR. Thanks @jiayuasu and @Kontinuation to continue the effort! Let's work together to make it to the finish line. |
Closing in favour of #43977 ! |
Rationale for this change
In apache/parquet-format#240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( apache/parquet-java#1379 ). This is a PR to explore what an implementation would look like in C++.
What changes are included in this PR?
Work in progress!
Are these changes tested?
They will be!
Are there any user-facing changes?
Yes! (And will eventually be documented)