You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I've been brainstorming with @adrianbr about a source connector pulling from S3 and GCS.
I'd like to start a discussion on its specification.
Why this is relevant:
Many data providers deliver raw data or reports to bucket storage. For example, B2B affiliate performance reports I've seen a couple of times delivered this way.
This can be the basis for additional readily configured source connectors, e.g. for Adjust which offers hourly free CSV export that can be configured by the Adjust customer. I plan to open a follow-up issue for an Adjust connector.
Supported file formats
Shall we start with only CSV and then support Parquet?
File identifiers
Often, data deliveries come in periodically added CSV or parquet files. I've seen files, such as report_from_acme_2023-07-14.csv, report_from_acme_2023-07-15.csv etc.
The file identifier has usually a fixed part, such as report_from_acme as well as a variable identifier, such as a timestamp or a date. Thus, I think the config for this source should include:
a regexp to filter files and identify the fixed part of the file identifier to allow users building one pipeline for report_from_acme and another for report_from_foobar.
a regexp to filter files that should be loaded per pipeline invocation. E.g. if it's scheduled daily then it should load the filename that matches the current_date(). Alternatively, we can somehow configure the incremental load strategy to just load what hasn't been loaded yet – without parsing the dates or timestamps in the filename.
Loading method
@adrianbr suggested figuring out incremental processing of the CSV files so that we can load a 2GB file on a 256MB RAM machine, such as on GH-actions. I love the idea.
Loading procedure:
load the source files completely to disk, not memory
read the source file in chunks into memory and yield line by line for low RAM footprint
Alternative:
Maybe there is a way to guess the bytes required per row so that we can download files in chunks from S3 to disk and then load incrementally to memory to parse the csv into records.
Pandas' has a read_csv() method which supports reading from S3/GCS and also supports reading in chunks. However, I would not use it because:
it would add the heavy heavy Pandas as a dependency
it might interfere with the typing done by dlt?
The text was updated successfully, but these errors were encountered:
willi-mueller
changed the title
[source] Add Bucket storage sources fro GCS & S3
[source] Add bucket storage sources from GCS & S3
Jul 14, 2023
Hi, I've been brainstorming with @adrianbr about a source connector pulling from S3 and GCS.
I'd like to start a discussion on its specification.
Why this is relevant:
Many data providers deliver raw data or reports to bucket storage. For example, B2B affiliate performance reports I've seen a couple of times delivered this way.
This can be the basis for additional readily configured source connectors, e.g. for Adjust which offers hourly free CSV export that can be configured by the Adjust customer. I plan to open a follow-up issue for an Adjust connector.
Supported file formats
Shall we start with only CSV and then support Parquet?
File identifiers
Often, data deliveries come in periodically added CSV or parquet files. I've seen files, such as
report_from_acme_2023-07-14.csv
,report_from_acme_2023-07-15.csv
etc.The file identifier has usually a fixed part, such as report_from_acme as well as a variable identifier, such as a timestamp or a date. Thus, I think the config for this source should include:
Loading method
@adrianbr suggested figuring out incremental processing of the CSV files so that we can load a 2GB file on a 256MB RAM machine, such as on GH-actions. I love the idea.
Loading procedure:
Alternative:
Maybe there is a way to guess the bytes required per row so that we can download files in chunks from S3 to disk and then load incrementally to memory to parse the csv into records.
Pandas' has a
read_csv()
method which supports reading from S3/GCS and also supports reading in chunks. However, I would not use it because:The text was updated successfully, but these errors were encountered: