Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[source] Add bucket storage sources from GCS & S3 #494

Closed
willi-mueller opened this issue Jul 14, 2023 · 2 comments
Closed

[source] Add bucket storage sources from GCS & S3 #494

willi-mueller opened this issue Jul 14, 2023 · 2 comments

Comments

@willi-mueller
Copy link
Collaborator

willi-mueller commented Jul 14, 2023

Hi, I've been brainstorming with @adrianbr about a source connector pulling from S3 and GCS.

I'd like to start a discussion on its specification.

Why this is relevant:

Many data providers deliver raw data or reports to bucket storage. For example, B2B affiliate performance reports I've seen a couple of times delivered this way.

This can be the basis for additional readily configured source connectors, e.g. for Adjust which offers hourly free CSV export that can be configured by the Adjust customer. I plan to open a follow-up issue for an Adjust connector.

Supported file formats

Shall we start with only CSV and then support Parquet?

File identifiers

Often, data deliveries come in periodically added CSV or parquet files. I've seen files, such as
report_from_acme_2023-07-14.csv, report_from_acme_2023-07-15.csv etc.

The file identifier has usually a fixed part, such as report_from_acme as well as a variable identifier, such as a timestamp or a date. Thus, I think the config for this source should include:

  1. a regexp to filter files and identify the fixed part of the file identifier to allow users building one pipeline for report_from_acme and another for report_from_foobar.
  2. a regexp to filter files that should be loaded per pipeline invocation. E.g. if it's scheduled daily then it should load the filename that matches the current_date(). Alternatively, we can somehow configure the incremental load strategy to just load what hasn't been loaded yet – without parsing the dates or timestamps in the filename.

Loading method

@adrianbr suggested figuring out incremental processing of the CSV files so that we can load a 2GB file on a 256MB RAM machine, such as on GH-actions. I love the idea.

Loading procedure:

  1. load the source files completely to disk, not memory
  2. read the source file in chunks into memory and yield line by line for low RAM footprint

Alternative:
Maybe there is a way to guess the bytes required per row so that we can download files in chunks from S3 to disk and then load incrementally to memory to parse the csv into records.

Pandas' has a read_csv() method which supports reading from S3/GCS and also supports reading in chunks. However, I would not use it because:

  1. it would add the heavy heavy Pandas as a dependency
  2. it might interfere with the typing done by dlt?
@willi-mueller willi-mueller changed the title [source] Add Bucket storage sources fro GCS & S3 [source] Add bucket storage sources from GCS & S3 Jul 14, 2023
@willi-mueller
Copy link
Collaborator Author

I'm sorry, wrong repo. I reopened it in dlt-hub/verified-sources#216

@willi-mueller willi-mueller closed this as not planned Won't fix, can't repro, duplicate, stale Jul 17, 2023
@rudolfix
Copy link
Collaborator

@willi-mueller thanks I was just writing you to move it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants