Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-label filtering #607

Closed
wants to merge 5 commits into from
Closed

Multi-label filtering #607

wants to merge 5 commits into from

Conversation

Elssky
Copy link
Contributor

@Elssky Elssky commented Aug 29, 2024

Reason for this PR

The support of multi-label function is very important for users. This PR implements compatibility with multi-label data and provides label filtering function (to facilitate users to quickly find specific label vertices)

What changes are included in this PR?

It mainly implements three major functions:

  1. Read multi-label data and build graphs for storage.
  2. Allow users to specify one or more labels to filter vertices.
  3. Access label information of a single vertex.

Are these changes tested?

Yes! We use test_multi_label.cc to generate multi labels chunks in parquet filetype and high_level_label_reader_example.cc to verify the correctness of the function.
And we conducted a performance comparison test with acero in label_filter_benchmark.cc . It seem we are about 5 times faster.
image

Are there any user-facing changes?

yes, and it has been modified before, see #605

@Elssky Elssky changed the title Multi label [WIP] Multi label Aug 29, 2024
@Elssky Elssky force-pushed the multi-label branch 7 times, most recently from ffa6dc2 to a7eadb7 Compare August 29, 2024 10:07
@Elssky Elssky force-pushed the multi-label branch 3 times, most recently from 882d58c to e8904bb Compare September 18, 2024 08:45
@Elssky Elssky marked this pull request as ready for review September 18, 2024 08:47
@Elssky Elssky force-pushed the multi-label branch 3 times, most recently from 0b83bcf to df6dc99 Compare September 18, 2024 09:59
@acezen
Copy link
Contributor

acezen commented Sep 24, 2024

Hi, @Elssky, it's better to remove the WIP to trigger the CI.

@Elssky Elssky changed the title [WIP] Multi label Multi-label filtering Sep 24, 2024
Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Elssky, thanks for the contribution, I have roughly review the PR and leaved some comments. To make the PR more clear and elegant, I highly recommend that

  • Add document or update the document if you add new API or break down the current API
  • Don't make this feature as a HUGE PR if you can split the feature into some sub-feature. That would make reviewing easier and help to back track.

auto filter_vertices = maybe_filter_vertices_collection.value();
std::cout << "valid vertices num: " << filter_vertices->size() << std::endl;

// for (auto it = filter_vertices->begin(); it != filter_vertices->end();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may uncomment the code or remove it before merge the PR.

auto filter_vertices_2 = maybe_filter_vertices_collection_2.value();
std::cout << "valid vertices num: " << filter_vertices_2->size() << std::endl;

// for (auto it = filter_vertices_2->begin(); it != filter_vertices_2->end();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

auto filter_vertices_3 = maybe_filter_vertices_collection_3.value();
std::cout << "valid vertices num: " << filter_vertices_3->size() << std::endl;

// for (auto it = filter_vertices_3->begin(); it != filter_vertices_3->end();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@@ -51,6 +51,18 @@ Result<std::shared_ptr<arrow::Schema>> PropertyGroupToSchema(
return arrow::schema(fields);
}

Result<std::shared_ptr<arrow::Schema>> LabelToSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the method for? can you add some document for the method?

Copy link
Contributor Author

@Elssky Elssky Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is intended to be used when the VertexPropertyArrowChunkReader is initializing theschema_, see Line 164

parquet::WriterProperties::Builder builder;
builder.compression(arrow::Compression::type::ZSTD); // enable compression
// builder.compression(arrow::Compression::type::ZSTD); // enable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comment the compression code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This was set up in an experimental comparison, I'll withdraw this.

@Elssky
Copy link
Contributor Author

Elssky commented Sep 24, 2024

Hi, @Elssky, thanks for the contribution, I have roughly review the PR and leaved some comments. To make the PR more clear and elegant, I highly recommend that

  • Add document or update the document if you add new API or break down the current API
  • Don't make this feature as a HUGE PR if you can split the feature into some sub-feature. That would make reviewing easier and help to back track.

yes, I seperate this PR to #634 and 8813288

@Elssky Elssky marked this pull request as draft October 23, 2024 06:17
@Elssky Elssky closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants