[source] Add bucket storage sources from GCS & S3 #494

willi-mueller · 2023-07-14T13:14:35Z

Hi, I've been brainstorming with @adrianbr about a source connector pulling from S3 and GCS.

I'd like to start a discussion on its specification.

Why this is relevant:

Many data providers deliver raw data or reports to bucket storage. For example, B2B affiliate performance reports I've seen a couple of times delivered this way.

This can be the basis for additional readily configured source connectors, e.g. for Adjust which offers hourly free CSV export that can be configured by the Adjust customer. I plan to open a follow-up issue for an Adjust connector.

Supported file formats

Shall we start with only CSV and then support Parquet?

File identifiers

Often, data deliveries come in periodically added CSV or parquet files. I've seen files, such as
report_from_acme_2023-07-14.csv, report_from_acme_2023-07-15.csv etc.

The file identifier has usually a fixed part, such as report_from_acme as well as a variable identifier, such as a timestamp or a date. Thus, I think the config for this source should include:

a regexp to filter files and identify the fixed part of the file identifier to allow users building one pipeline for report_from_acme and another for report_from_foobar.
a regexp to filter files that should be loaded per pipeline invocation. E.g. if it's scheduled daily then it should load the filename that matches the current_date(). Alternatively, we can somehow configure the incremental load strategy to just load what hasn't been loaded yet – without parsing the dates or timestamps in the filename.

Loading method

@adrianbr suggested figuring out incremental processing of the CSV files so that we can load a 2GB file on a 256MB RAM machine, such as on GH-actions. I love the idea.

Loading procedure:

load the source files completely to disk, not memory
read the source file in chunks into memory and yield line by line for low RAM footprint

Alternative:
Maybe there is a way to guess the bytes required per row so that we can download files in chunks from S3 to disk and then load incrementally to memory to parse the csv into records.

Pandas' has a read_csv() method which supports reading from S3/GCS and also supports reading in chunks. However, I would not use it because:

it would add the heavy heavy Pandas as a dependency
it might interfere with the typing done by dlt?

The text was updated successfully, but these errors were encountered:

willi-mueller · 2023-07-17T12:54:11Z

I'm sorry, wrong repo. I reopened it in dlt-hub/verified-sources#216

rudolfix · 2023-07-17T12:56:32Z

@willi-mueller thanks I was just writing you to move it :)

willi-mueller changed the title ~~[source] Add Bucket storage sources fro GCS & S3~~ [source] Add bucket storage sources from GCS & S3 Jul 14, 2023

willi-mueller closed this as not planned Won't fix, can't repro, duplicate, stale Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[source] Add bucket storage sources from GCS & S3 #494

[source] Add bucket storage sources from GCS & S3 #494

willi-mueller commented Jul 14, 2023 •

edited

Loading

willi-mueller commented Jul 17, 2023

rudolfix commented Jul 17, 2023

[source] Add bucket storage sources from GCS & S3 #494

[source] Add bucket storage sources from GCS & S3 #494

Comments

willi-mueller commented Jul 14, 2023 • edited Loading

Why this is relevant:

Supported file formats

File identifiers

Loading method

Loading procedure:

willi-mueller commented Jul 17, 2023

rudolfix commented Jul 17, 2023

willi-mueller commented Jul 14, 2023 •

edited

Loading