-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[filesystem] verified source #216
Comments
thanks for bringing this! we are researching this topic for some time, the aim is for something both easy to use and very customizable. you can also take a look at dlt-hub/dlt#338 I'd break this into two dlt resources
in dlt you can pipe data from one source to another so we can join (1) and (2). they can be also used separately. we plan to use (1) to feed data from files into our langchain source You can also use (2) standalone and just pass a list of files into it btw. I totally agree that the data from files should be read and yielded in batches. IMO standard csv reader would do (https://docs.python.org/3/library/csv.html) - it should accept file pointer from fsspec. same for parquet (pyarrow reads in chunks and uses fsspec internally). there's also really cool next steps:
btw. I think this kind of source is so fundamental that at some point we'll move it to the core library |
we have |
Quick source info
fsspec
supported file system: cloud object storage, S3, GCS, ADSL, local.Current Status
What source does/will do
This connector loads a series of files from an S3 or GCS bucket. Which files will be loaded depends on a configuration giving a file name pattern as well as the
write_disposition
.Why this is relevant
Many data providers deliver raw data or reports to bucket storage in S3 or GCS.
This connector can be the basis for additional readily configured source connectors that import from buckets. For example, Adjust offers hourly free CSV export that can be configured by the Adjust customer. I plan to open a follow-up issue for an Adjust connector.
Test account / test data
dltHub
to create a test account before we startdltHub
after we merge the sourceAdditional context
This source, like other standard readers, will be used as a first step in data loading pipeline. In essence
fsspec
instance that behaves like file object - so next step in the pipeline can read the file - often piece by pieceAll other common properties of standard source apply
mtime
of a file (optional)dlt
core code reusedlt-hub/dlt#626 - introduces fsspec into common of dlt library. take things from there. also look at tests with usage examples
Requirements
dlt.resource
which takes following arguments:bucket_url
andcredentials
arguments as per ... specfilename_filter
which can be compiled regex/glob expression or callback functionbucket_url
and filters out files with regex/glob/callbackAbstractFileSystem
viaclient_from_config
Background / Use Cases
File Identifiers
Often, data deliveries come in periodically added CSV or parquet files. I've seen files, such as
report_from_acme_2023-07-14.csv
,report_from_acme_2023-07-15.csv
etc.The file identifier has usually a fixed part, such as report_from_acme as well as a variable identifier, such as a timestamp or a date. Thus, I think the config for this source should include:
Loading method
@adrianbr suggested figuring out incremental processing of the CSV files so that we can load a 2GB file on a 256MB RAM machine, such as on GH-actions. I love the idea.
Loading procedure:
Alternative:
Maybe there is a way to guess the bytes required per row so that we can download files in chunks from S3 to disk and then load incrementally to memory to parse the CSV into records.
Pandas' has a
read_csv()
method which supports reading from S3/GCS and also supports reading in chunks. However, I would not use it because:The text was updated successfully, but these errors were encountered: