[Feature Request] Ingestion pipelines using S3 compatible storage instead of base64 encoded data #16170

ksanderer · 2024-10-02T14:03:10Z

Is your feature request related to a problem? Please describe

It's frustrating that we can't use S3 directly in ingestion pipelines. We must first load a file from S3-compatible storage, base encode it, and then push it to the OpenSearch API.

It should be possible to use direct S3 links (e.g., s3://{bucket}/path_to_file.pdf) or provide an S3 key to ingest the file directly.

Describe the solution you'd like

Instead of fetching files from S3 and pushing them to OpenSearch using separate tools (e.g. python service):

import boto3
import base64
import requests

def fetch_file_from_s3(bucket_name, file_key):
    s3 = boto3.client('s3')
    response = s3.get_object(Bucket=bucket_name, Key=file_key)
    return response['Body'].read()

def push_to_opensearch(index, doc_id, filename, title, data, pipeline):
    url = f"https://localhost:9200/{index}/_doc/{doc_id}?pipeline={pipeline}"
    headers = {'Content-Type': 'application/json'}
    payload = {
        "filename": filename,
        "title": title,
        "data": data
    }
    response = requests.put(url, json=payload, headers=headers)
    return response.status_code, response.text

def main(bucket_name, file_key, index, doc_id, pipeline):
    file_data = fetch_file_from_s3(bucket_name, file_key)
    encoded_data = base64.b64encode(file_data).decode('utf-8')
    
    status_code, response_text = push_to_opensearch(index, doc_id, file_key, 'Dummy PDF', encoded_data, pipeline)
    print(f"Status Code: {status_code}")
    print(f"Response: {response_text}")

if __name__ == "__main__":
    bucket_name = 'my_bucket'
    file_key = 'dummy.pdf'
    index = 'my_index'
    doc_id = '1'
    pipeline = 'file_attachment'
    
    main(bucket_name, file_key, index, doc_id, pipeline)

We can push files directly to OpenSearch:

// PUT https://localhost:9200/my_index/_doc/1?pipeline=s3_ingestion_pipeline
{
  "filename": "dummy.pdf",
  "title": "Dummy PDF",
  "s3_key": "s3://my_bucket/dummy.pdf"
}

The idea is to use a predefined S3 bucket, similar to snapshot repositories that can be configured to use S3 storage.

Related component

Other

Describe alternatives you've considered

Amazon offers an SQS-powered solution, but it's not available on other platforms like DigitalOcean OpenSearch.

We currently use a small Python service for this purpose. It receives an S3 key, fetches the file, and pushes the content to the OpenSearch cluster.

Additional context

No response

varunpareek690 · 2024-10-10T19:21:32Z

If we are on AWS, one potential interim solution is using S3 Event triggers (S3 + Lambda) to automatically send data to the Python service whenever a new file is uploaded. This would reduce the manual step of specifying file keys and streamline the ingestion process. However, this might not help in non-AWS environments.
For environments like DigitalOcean, exploring existing third-party services for file ingestion that can be hooked into S3-compatible storage solutions might save you from reinventing the wheel.
Instead of downloading files from S3-compatible storage to a local service and then uploading them to OpenSearch, stream the file directly to OpenSearch. This would eliminate the need to hold the entire file in memory.

I would like to give it a try!

ksanderer · 2024-10-10T19:55:10Z

Since we’re already using S3 and OpenSearch, adding a new technology for file ingestion could make things more complicated than necessary. OpenSearch has a file ingestion API, and since S3 is widely used as a modern filesystem, it makes sense to take advantage of this.

By ingesting files directly from S3 URLs, we simplify the process, reduce the need for extra services, and make better use of what we already have in place. This approach is both efficient and scalable, without adding unnecessary complexity to our stack.

varunpareek690 · 2024-10-11T13:59:31Z

Hi @ksanderer ,

Thank you for the insights! I agree that minimizing complexity is crucial.
While I understand the advantages of using the OpenSearch file ingestion API and ingesting files directly from S3 URLs, I still believe that exploring automation options, such as S3 Event triggers in AWS or integrating third-party services in non-AWS environments, could enhance our workflow. These approaches might streamline the process even further and help reduce manual intervention.

I’m particularly interested in how we can implement streaming directly to OpenSearch, as it could optimize our memory usage and overall efficiency.

dblock · 2024-10-21T16:08:12Z

[Catch All Triage - 1, 2]

varunpareek690 · 2024-10-22T14:45:33Z

Hi @dblock! What does this comment signify...? Can you please explain

dblock · 2024-10-23T15:39:32Z

Sorry for the cryptic comment :) Check out opensearch-project/.github#233, does this help?

ksanderer added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 2, 2024

github-actions bot added the Other label Oct 2, 2024

dblock removed the untriaged label Oct 21, 2024

dblock added the ingest-pipeline label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Ingestion pipelines using S3 compatible storage instead of base64 encoded data #16170

[Feature Request] Ingestion pipelines using S3 compatible storage instead of base64 encoded data #16170

ksanderer commented Oct 2, 2024 •

edited

Loading

varunpareek690 commented Oct 10, 2024

ksanderer commented Oct 10, 2024

varunpareek690 commented Oct 11, 2024

dblock commented Oct 21, 2024

varunpareek690 commented Oct 22, 2024

dblock commented Oct 23, 2024

[Feature Request] Ingestion pipelines using S3 compatible storage instead of base64 encoded data #16170

[Feature Request] Ingestion pipelines using S3 compatible storage instead of base64 encoded data #16170

Comments

ksanderer commented Oct 2, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

varunpareek690 commented Oct 10, 2024

ksanderer commented Oct 10, 2024

varunpareek690 commented Oct 11, 2024

dblock commented Oct 21, 2024

varunpareek690 commented Oct 22, 2024

dblock commented Oct 23, 2024

ksanderer commented Oct 2, 2024 •

edited

Loading