Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Search streams using Apache Arrow and Flight #16679

Open
1 of 15 tasks
rishabhmaurya opened this issue Nov 19, 2024 · 3 comments · Fixed by #16691
Open
1 of 15 tasks

[META] Search streams using Apache Arrow and Flight #16679

rishabhmaurya opened this issue Nov 19, 2024 · 3 comments · Fixed by #16691
Assignees
Labels
Meta Meta issue, not directly linked to a PR

Comments

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Nov 19, 2024

Please describe the end goal of this project

  • In-memory columnar representation of any intermediate results from search
    • Data adjacency for sequential access (scans)
    • O(1) (constant-time) random access
    • SIMD and vectorization-friendly
    • Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory.
  • Interoperable representation of columnar data to be used across different engines like sharing between opensearch and datafusion, which is a rust based engine.
  • RPC using bidirectional streams: making use of GRPC bidirectional streams handling backpressure from the client in realtime and producing batches of records on demand. Used both for internode communication (between data nodes and cordinator) as well as communication with end client.

Use cases

  • Optimize memory overhead, cpu utilization and performance for -
    • Search pagination API
    • Aggregation (more details to follow) .

Apache Arrow will serve as a library for in-memory columnar representation on any transient results used for retrieval in these use cases. Arrow Flight to be used for stream RPC.

Supporting References

JOINs RFC making use of this integration - #15185

Issues

Related component

Search

@rishabhmaurya rishabhmaurya added Meta Meta issue, not directly linked to a PR untriaged labels Nov 19, 2024
@rishabhmaurya rishabhmaurya self-assigned this Nov 19, 2024
@rishabhmaurya rishabhmaurya moved this from New to In Progress in OpenSearch Roadmap Nov 19, 2024
@rishabhmaurya rishabhmaurya changed the title [META] Apache Arrow and Flight integration [META] Search streams using Apache Arrow and Flight Nov 25, 2024
@rishabhmaurya rishabhmaurya moved this from Todo to In Progress in Performance Roadmap Nov 25, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Nov 29, 2024
@harshavamsi harshavamsi reopened this Dec 3, 2024
@github-project-automation github-project-automation bot moved this from Done to In Progress in Performance Roadmap Dec 3, 2024
@dblock dblock removed the untriaged label Dec 9, 2024
@dblock
Copy link
Member

dblock commented Dec 9, 2024

[Catch All Triage - 1, 2, 3, 4]

@jngz-es
Copy link

jngz-es commented Jan 22, 2025

This feature is awesome! We have one more use case from OpenSearch ML opensearch-project/ml-commons#3328 to address streaming model/agent prediction especially for LLM calls.

@rishabhmaurya
Copy link
Contributor Author

@jngz-es https://huggingface.co/docs/datasets/en/about_arrow this is in context of LLM datasets and how arrow format can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Meta issue, not directly linked to a PR
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging a pull request may close this issue.

4 participants