-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-label filtering #607
Conversation
ffa6dc2
to
a7eadb7
Compare
882d58c
to
e8904bb
Compare
0b83bcf
to
df6dc99
Compare
Hi, @Elssky, it's better to remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @Elssky, thanks for the contribution, I have roughly review the PR and leaved some comments. To make the PR more clear and elegant, I highly recommend that
- Add document or update the document if you add new API or break down the current API
- Don't make this feature as a HUGE PR if you can split the feature into some sub-feature. That would make reviewing easier and help to back track.
auto filter_vertices = maybe_filter_vertices_collection.value(); | ||
std::cout << "valid vertices num: " << filter_vertices->size() << std::endl; | ||
|
||
// for (auto it = filter_vertices->begin(); it != filter_vertices->end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may uncomment the code or remove it before merge the PR.
auto filter_vertices_2 = maybe_filter_vertices_collection_2.value(); | ||
std::cout << "valid vertices num: " << filter_vertices_2->size() << std::endl; | ||
|
||
// for (auto it = filter_vertices_2->begin(); it != filter_vertices_2->end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
auto filter_vertices_3 = maybe_filter_vertices_collection_3.value(); | ||
std::cout << "valid vertices num: " << filter_vertices_3->size() << std::endl; | ||
|
||
// for (auto it = filter_vertices_3->begin(); it != filter_vertices_3->end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
@@ -51,6 +51,18 @@ Result<std::shared_ptr<arrow::Schema>> PropertyGroupToSchema( | |||
return arrow::schema(fields); | |||
} | |||
|
|||
Result<std::shared_ptr<arrow::Schema>> LabelToSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the method for? can you add some document for the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is intended to be used when the VertexPropertyArrowChunkReader is initializing theschema_
, see Line 164
parquet::WriterProperties::Builder builder; | ||
builder.compression(arrow::Compression::type::ZSTD); // enable compression | ||
// builder.compression(arrow::Compression::type::ZSTD); // enable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why comment the compression code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This was set up in an experimental comparison, I'll withdraw this.
|
Reason for this PR
The support of multi-label function is very important for users. This PR implements compatibility with multi-label data and provides label filtering function (to facilitate users to quickly find specific label vertices)
What changes are included in this PR?
It mainly implements three major functions:
Are these changes tested?
Yes! We use
data:image/s3,"s3://crabby-images/29727/29727eed375433a8f5a3c954d9b9fdbf5871609a" alt="image"
test_multi_label.cc
to generate multi labels chunks in parquet filetype andhigh_level_label_reader_example.cc
to verify the correctness of the function.And we conducted a performance comparison test with acero in
label_filter_benchmark.cc
. It seem we are about 5 times faster.Are there any user-facing changes?
yes, and it has been modified before, see #605