-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Ingestion pipelines using S3 compatible storage instead of base64 encoded data #16170
Comments
I would like to give it a try! |
Since we’re already using S3 and OpenSearch, adding a new technology for file ingestion could make things more complicated than necessary. OpenSearch has a file ingestion API, and since S3 is widely used as a modern filesystem, it makes sense to take advantage of this. By ingesting files directly from S3 URLs, we simplify the process, reduce the need for extra services, and make better use of what we already have in place. This approach is both efficient and scalable, without adding unnecessary complexity to our stack. |
Hi @ksanderer , Thank you for the insights! I agree that minimizing complexity is crucial. I’m particularly interested in how we can implement streaming directly to OpenSearch, as it could optimize our memory usage and overall efficiency. |
Hi @dblock! What does this comment signify...? Can you please explain |
Sorry for the cryptic comment :) Check out opensearch-project/.github#233, does this help? |
Is your feature request related to a problem? Please describe
It's frustrating that we can't use S3 directly in ingestion pipelines. We must first load a file from S3-compatible storage, base encode it, and then push it to the OpenSearch API.
It should be possible to use direct S3 links (e.g., s3://{bucket}/path_to_file.pdf) or provide an S3 key to ingest the file directly.
Describe the solution you'd like
Instead of fetching files from S3 and pushing them to OpenSearch using separate tools (e.g. python service):
We can push files directly to OpenSearch:
The idea is to use a predefined S3 bucket, similar to snapshot repositories that can be configured to use S3 storage.
Related component
Other
Describe alternatives you've considered
Amazon offers an SQS-powered solution, but it's not available on other platforms like DigitalOcean OpenSearch.
We currently use a small Python service for this purpose. It receives an S3 key, fetches the file, and pushes the content to the OpenSearch cluster.
Additional context
No response
The text was updated successfully, but these errors were encountered: