Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(filesystem): implement a csv reader with duckdb engine #319

Merged
merged 23 commits into from
Jan 31, 2024

Conversation

IlyaFaer
Copy link
Contributor

@IlyaFaer IlyaFaer commented Jan 12, 2024

Towards #299

A raw version of the source to discuss some details.

Test data in a form of CSV files and folders is required in different buckets. For now, I tested it on my local filesystem:

Untitled

Data from two CSV files is read and saved in the same table. As it works through the filesystem source, all the kinds of storages supported by filesystem are supposed to be working here as well. But as I don't have straight access to the buckets to create test CSV files, I had to use local file system.

@IlyaFaer IlyaFaer requested a review from rudolfix January 12, 2024 08:15
sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IlyaFaer good start. check my comment in read_csv

sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
@IlyaFaer
Copy link
Contributor Author

This PR requires another PR to be merged first: dlt-hub/dlt#906

Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's so cool that passing fsspec file works with duck db! next steps:

  • move it to filesystem source, see review
  • you still do not read in batches really (see review)
  • I think we need both json and arrow option when yielding items (see review :)

sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
sources/csv_reader/__init__.py Outdated Show resolved Hide resolved
@IlyaFaer IlyaFaer marked this pull request as ready for review January 24, 2024 12:42
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is not a WIP anymore - we are almost ready to merge :)

  1. pls fix the review
  2. pls add demo (can be super simple along stream_and_merge_csv)
  3. pls document new reader in filesystem README

sources/filesystem/readers.py Outdated Show resolved Hide resolved
tests/filesystem/test_filesystem.py Outdated Show resolved Hide resolved
@IlyaFaer IlyaFaer changed the title WIP: a prototype of a csv reader with duckdb engine feat(filesystem): implement a csv reader with duckdb engine Jan 25, 2024
@rudolfix
Copy link
Contributor

@IlyaFaer could you also look at this: https://clickhouse.com/docs/en/getting-started/example-datasets/nyc-taxi it is a taxi dataset as zip csv. it is quite large. pls make sure you can ingest it with duckdb reader

  1. make sure we ingest zip streams (I think duckdb detects compression)
  2. you do not need to load it all
  3. you may add this as a duckb example I asked above

sources/filesystem/readers.py Outdated Show resolved Hide resolved
rudolfix
rudolfix previously approved these changes Jan 26, 2024
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

files are added, but the core PR must be fixed and merged first

@rudolfix rudolfix merged commit 29b86fa into master Jan 31, 2024
14 checks passed
@rudolfix rudolfix deleted the csv_duckdb_reader branch January 31, 2024 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants