-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(filesystem): implement a csv reader with duckdb engine #319
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IlyaFaer good start. check my comment in read_csv
This PR requires another PR to be merged first: dlt-hub/dlt#906 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's so cool that passing fsspec file works with duck db! next steps:
- move it to
filesystem
source, see review - you still do not read in batches really (see review)
- I think we need both json and arrow option when yielding items (see review :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is not a WIP anymore - we are almost ready to merge :)
- pls fix the review
- pls add demo (can be super simple along
stream_and_merge_csv
) - pls document new reader in filesystem README
@IlyaFaer could you also look at this: https://clickhouse.com/docs/en/getting-started/example-datasets/nyc-taxi it is a taxi dataset as zip csv. it is quite large. pls make sure you can ingest it with duckdb reader
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
files are added, but the core PR must be fixed and merged first
Towards #299
A raw version of the source to discuss some details.
Test data in a form of CSV files and folders is required in different buckets. For now, I tested it on my local filesystem:
Data from two CSV files is read and saved in the same table. As it works through the
filesystem
source, all the kinds of storages supported byfilesystem
are supposed to be working here as well. But as I don't have straight access to the buckets to create test CSV files, I had to use local file system.