Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ingesting CSV files #2

Merged
merged 21 commits into from
Oct 31, 2024
Merged

Add support for ingesting CSV files #2

merged 21 commits into from
Oct 31, 2024

Conversation

daniel-thom
Copy link
Collaborator

@daniel-thom daniel-thom commented Oct 3, 2024

This PR adds support for the following:

  • Ingest tables from CSV files in a performant way (through Polars).
  • Write tables to Parquet.

It fixes an assortment of bugs and oversights from the initial version.

It also adds test coverage for most of the code. We are now at 94%.

The Polars bug with sqlalchemy has been fixed. You will now get v1.11.0. Also, sqlalchemy is now fixed to v2.0.35 because duckdb_engine is not yet compatible with 2.0.36.

This fixes read from the database through Polars and sqlalchemy after
recent fixes to the Polars Python package.
@daniel-thom daniel-thom requested review from lixiliu and pesap October 18, 2024 00:27
@lixiliu
Copy link
Collaborator

lixiliu commented Oct 24, 2024

This PR adds support for the following:

  • Ingest tables from CSV files in a performant way (through Polars).
  • Write tables to Parquet.

It fixes an assortment of bugs and oversights from the initial version.

It also adds test coverage for most of the code. We are now at 94%. However, CI will fail because we are depending on the main branch of Polars. It will pass whenever they release v1.9.1 or greater.

Can you describe the bugs you've fixed so we can pay attention as we review the code?

@daniel-thom
Copy link
Collaborator Author

This PR adds support for the following:

  • Ingest tables from CSV files in a performant way (through Polars).
  • Write tables to Parquet.

It fixes an assortment of bugs and oversights from the initial version.
It also adds test coverage for most of the code. We are now at 94%. However, CI will fail because we are depending on the main branch of Polars. It will pass whenever they release v1.9.1 or greater.

Can you describe the bugs you've fixed so we can pay attention as we review the code?

At this point the code has widely diverged from the original (which was never reviewed), so it might be best to look at all code as if it was new.

@daniel-thom daniel-thom force-pushed the feature/ingest-csv branch 3 times, most recently from 76f02f5 to 32c7173 Compare October 28, 2024 17:14
scripts/perf_tests.py Show resolved Hide resolved
src/chronify/csv_io.py Outdated Show resolved Hide resolved
) -> DuckDBPyRelation:
"""Add a datetime column to the relation."""
# TODO
raise NotImplementedError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we'll need two handling, one without tz-offset and one with.

src/chronify/models.py Show resolved Hide resolved
src/chronify/models.py Outdated Show resolved Hide resolved
src/chronify/models.py Outdated Show resolved Hide resolved
src/chronify/time.py Show resolved Hide resolved
"If None, timestamps are timezone-naive.",
),
] = None
start: datetime # TODO: what if the time zone is specified here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is supported in other packages like pandas, IMO, it can be ambiguous here because:
If only offset is specified, we can only assume the same offset throughout, making it a standard time assumption.

That said, we can support that as a future feature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is now different. There is no longer a time_zone field. The user has to specify the time zone, if any, in this field. Please raise any concerns.

src/chronify/time_series_checker.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@daniel-thom daniel-thom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added support for what we discussed last Thursday. Essentially, we will store whatever the user passes: time zones or not. datetimes will be converted to strings when using SQLite, and so we have to implement custom code to read it back. I didn't get anything to work with the sqlalchemy DateTime type with timezone=True.

Now, all of the applicable tests run on DuckDB and SQLite. Spark will come later.

pyproject.toml Show resolved Hide resolved
@@ -8,7 +8,3 @@ repos:
args: [ --fix ]
# Run the formatter.
- id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pesap My take is to remove it. Running it here doesn't use pyproject.toml. We would have to make other changes. The action for mypy does follow pyproject.toml.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass language system it uses the mypy that you have installed from the environment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that still doesn't respect settings in pyproject.toml. With the changes here, if the developer runs mypy in the terminal, the right things will happen. The CI job does that as a check. The downside is that pre-commit doesn't run mypy. We would have to add duplicate settings in the pre-commit config file.

src/chronify/time_configs.py Show resolved Hide resolved
scripts/perf_tests.py Show resolved Hide resolved
"If None, timestamps are timezone-naive.",
),
] = None
start: datetime # TODO: what if the time zone is specified here?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is now different. There is no longer a time_zone field. The user has to specify the time zone, if any, in this field. Please raise any concerns.

src/chronify/models.py Show resolved Hide resolved
"""Read a database query into a Pandas DataFrame."""
df = pl.read_database(query, connection=conn).to_pandas()
config = schema.time_config
if config.needs_utc_conversion(conn.engine.name):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note SQLite special cases here and in the next function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add pandas just to do the datetime conversion? Maybe we can just make simple function that uses bare python

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed that pandas is part of the public API. It is what the user gets when they call read_table. Numpy array is not a good option because the table will have mixed types. Arrow table is an option, but that's not what users want. I'm skeptical that people want a duckdb relation. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas is fine with me.

src/chronify/time.py Show resolved Hide resolved
Copy link
Collaborator

@pesap pesap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nothing major to address only some comments.

@@ -8,7 +8,3 @@ repos:
args: [ --fix ]
# Run the formatter.
- id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass language system it uses the mypy that you have installed from the environment

scripts/perf_tests.py Show resolved Hide resolved
Comment on lines +10 to +19
def read_csv(path: Path | str, schema: CsvTableSchema, **kwargs: Any) -> DuckDBPyRelation:
"""Read a CSV file into a DuckDB relation."""
if schema.column_dtypes:
dtypes = {x.name: get_duckdb_type_from_sqlalchemy(x.dtype) for x in schema.column_dtypes}
rel = duckdb.read_csv(str(path), dtype=dtypes, **kwargs)
else:
rel = duckdb.read_csv(str(path), **kwargs)

expr = ",".join(rel.columns)
return duckdb.sql(f"SELECT {expr} FROM rel")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we catch any errors or just let duckdb to trow and error for a badly format csv

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DuckDB should have it covered.

"""Read a database query into a Pandas DataFrame."""
df = pl.read_database(query, connection=conn).to_pandas()
config = schema.time_config
if config.needs_utc_conversion(conn.engine.name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add pandas just to do the datetime conversion? Maybe we can just make simple function that uses bare python

@@ -169,15 +174,15 @@ def get_standard_time(tz: TimeZone) -> TimeZone:


def get_prevailing_time(tz: TimeZone) -> TimeZone:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add definition here of what is prevailing time?

conn.commit()
self.update_table_schema()

def read_query(self, query: Selectable | str, schema: TableSchema) -> pd.DataFrame:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently an unfortunate limitation here. You have to pass the schema because SQLite is forcing us to perform custom logic on the time columns. We don't know which columns are time columns yet. We need a second table to store that metadata. I propose that we defer that to a subsequent PR.

scripts/perf_tests.py Show resolved Hide resolved
"""Read a database query into a Pandas DataFrame."""
df = pl.read_database(query, connection=conn).to_pandas()
config = schema.time_config
if config.needs_utc_conversion(conn.engine.name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas is fine with me.

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@daniel-thom daniel-thom merged commit ac69a9e into main Oct 31, 2024
6 checks passed
@daniel-thom daniel-thom deleted the feature/ingest-csv branch October 31, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants