Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library changes for Apache Arrow integration #16691

Merged
merged 9 commits into from
Nov 29, 2024

Conversation

rishabhmaurya
Copy link
Contributor

@rishabhmaurya rishabhmaurya commented Nov 19, 2024

Description

Library changes for Apache Arrow integration. Lib changes just exposes POJOs for creation of StreamProducer and registering them with StreamManager. StreamProducer in turn exposes BatchedJob which are based on creation and filling Arrow Vectors in a batched manner handling client backpressure. So the arrow APIs exposed are kept minimal, limited to -

  api "org.apache.arrow:arrow-vector:${versions.arrow}"
  api "org.apache.arrow:arrow-format:${versions.arrow}"
  api "org.apache.arrow:arrow-memory-core:${versions.arrow}"

server module will depend on libs:arrow to create and register StreamProducer for search results by populating vectors with a well defined schema for search results.

Future PR will contain a module modules:arrow-flight-rpc, which will actually expose the Arrow Flight server, client and actual logic to create FlightStream. It will be a bulky modules in terms of all its direct and transitive dependencies.
seq_diag

Related Issues

Resolves #16679

Check List

  • Functionality includes testing.
    - [ ] API changes companion pull request created, if applicable.
    - [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rishabhmaurya rishabhmaurya changed the title Library changes for arrow integration Library changes for Apache Arrow integration Nov 19, 2024
@rishabhmaurya rishabhmaurya self-assigned this Nov 19, 2024
Copy link
Contributor

Hello!
We have added a performance benchmark workflow that runs by adding a comment on the PR.
Please refer https://github.com/opensearch-project/OpenSearch/blob/main/PERFORMANCE_BENCHMARKS.md on how to run benchmarks on pull requests.

Copy link
Contributor

❌ Gradle check result for c4d0735: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions github-actions bot added the Meta Meta issue, not directly linked to a PR label Nov 19, 2024
@rishabhmaurya rishabhmaurya added skip-changelog and removed Meta Meta issue, not directly linked to a PR labels Nov 19, 2024
@github-actions github-actions bot added the Meta Meta issue, not directly linked to a PR label Nov 19, 2024
@rishabhmaurya rishabhmaurya force-pushed the arrow-integration-lib branch 3 times, most recently from 59c0acd to fe262f2 Compare November 19, 2024 22:05
Copy link
Contributor

❌ Gradle check result for fe262f2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 3295caf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishabh Maurya <[email protected]>
Signed-off-by: Rishabh Maurya <[email protected]>
Copy link
Contributor

❕ Gradle check result for dca6d1f: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteStoreStatsIT.testNonZeroPrimaryStatsOnNewlyCreatedIndexWithZeroDocs

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

libs/arrow-spi/build.gradle Outdated Show resolved Hide resolved
libs/arrow-spi/build.gradle Outdated Show resolved Hide resolved
Signed-off-by: Rishabh Maurya <[email protected]>
Copy link
Contributor

✅ Gradle check result for e472e7e: SUCCESS

Copy link
Contributor

✅ Gradle check result for 49b701e: SUCCESS

@reta
Copy link
Collaborator

reta commented Nov 28, 2024

A few nits , but LGTM otherwise @rishabhmaurya ! I think the SPI is looking good to start with (certainly will be evolving but this is fine).

Signed-off-by: Rishabh Maurya <[email protected]>
Copy link
Contributor

✅ Gradle check result for 7ef6dc3: SUCCESS

@rishabhmaurya
Copy link
Contributor Author

@reta I agree! StreamReader is the one which will evolve most, its very minimal right now. I appreciate your feedbacks in improving the APIs, looks a lot better now. I'm looking forward to your reviews for the next PR related to server bootstrap logic.

Copy link
Collaborator

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrross do you want to take a look? thanks!

@andrross
Copy link
Member

@rishabhmaurya Just curious, how are you going to handle the Guava dependency that Arrow has? The code currently forbids the server module from taking a Guava dependency (presumably to prevent dependency conflicts with plugins that may require a different version?). Will you be able to isolate that dependency to this future arrow-flight-rpc module?

@andrross andrross merged commit 6d3fd37 into opensearch-project:main Nov 29, 2024
39 of 42 checks passed
@andrross
Copy link
Member

@rishabhmaurya Should we backport to 2.x?

@rishabhmaurya rishabhmaurya added the backport 2.x Backport to 2.x branch label Nov 29, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 29, 2024
* Library changes for arrow integration

Signed-off-by: Rishabh Maurya <[email protected]>

* Bump guava 32->33

Signed-off-by: Rishabh Maurya <[email protected]>

* add support for onCancel and Cancellable for BatchedJob in lib:arrow module

Signed-off-by: Rishabh Maurya <[email protected]>

* address PR comments

Signed-off-by: Rishabh Maurya <[email protected]>

* Move StreamTicket to an interface

Signed-off-by: Rishabh Maurya <[email protected]>

* remove jackson dependencies

Signed-off-by: Rishabh Maurya <[email protected]>

* make sl4j runtime only

Signed-off-by: Rishabh Maurya <[email protected]>

* introduce factory for stream ticket

Signed-off-by: Rishabh Maurya <[email protected]>

* Address PR comments

Signed-off-by: Rishabh Maurya <[email protected]>

---------

Signed-off-by: Rishabh Maurya <[email protected]>
(cherry picked from commit 6d3fd37)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Nov 29, 2024

how are you going to handle the Guava dependency that Arrow has? The code currently forbids the server module from taking a Guava dependency (presumably to prevent dependency conflicts with plugins that may require a different version?). Will you be able to isolate that dependency to this future arrow-flight-rpc module?

@andrross yes, i might be wrong but its mostly arrow flight modules and not arrow vector and memory which needs guava dependencies and thus it will just be isolated to the arrow-flight-rpc module part of my feature branch . I ended up bumping up the version of guava early as part of this PR though.

Should we backport to 2.x?

yes, just added the label.

@andrross
Copy link
Member

yes, i might be wrong but its mostly arrow flight modules and not arrow vector and memory which needs guava dependencies

@rishabhmaurya I'm guessing Guava comes transitively via the grpc dependencies, which are probably isolated to the flight modules.

reta pushed a commit that referenced this pull request Nov 29, 2024
* Library changes for arrow integration



* Bump guava 32->33



* add support for onCancel and Cancellable for BatchedJob in lib:arrow module



* address PR comments



* Move StreamTicket to an interface



* remove jackson dependencies



* make sl4j runtime only



* introduce factory for stream ticket



* Address PR comments



---------


(cherry picked from commit 6d3fd37)

Signed-off-by: Rishabh Maurya <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch Meta Meta issue, not directly linked to a PR Search:Performance skip-changelog
Projects
Status: Todo
Archived in project
Development

Successfully merging this pull request may close these issues.

[META] Search streams using Apache Arrow and Flight
3 participants