Add support for ingesting CSV files #2

daniel-thom · 2024-10-03T23:00:21Z

This PR adds support for the following:

Ingest tables from CSV files in a performant way (through Polars).
Write tables to Parquet.

It fixes an assortment of bugs and oversights from the initial version.

It also adds test coverage for most of the code. We are now at 94%.

The Polars bug with sqlalchemy has been fixed. You will now get v1.11.0. Also, sqlalchemy is now fixed to v2.0.35 because duckdb_engine is not yet compatible with 2.0.36.

This fixes read from the database through Polars and sqlalchemy after recent fixes to the Polars Python package.

lixiliu · 2024-10-24T17:22:21Z

This PR adds support for the following:

Ingest tables from CSV files in a performant way (through Polars).

Write tables to Parquet.

It fixes an assortment of bugs and oversights from the initial version.

It also adds test coverage for most of the code. We are now at 94%. However, CI will fail because we are depending on the main branch of Polars. It will pass whenever they release v1.9.1 or greater.

Can you describe the bugs you've fixed so we can pay attention as we review the code?

daniel-thom · 2024-10-24T18:58:17Z

This PR adds support for the following:

Ingest tables from CSV files in a performant way (through Polars).

Write tables to Parquet.

It fixes an assortment of bugs and oversights from the initial version.
It also adds test coverage for most of the code. We are now at 94%. However, CI will fail because we are depending on the main branch of Polars. It will pass whenever they release v1.9.1 or greater.

Can you describe the bugs you've fixed so we can pay attention as we review the code?

At this point the code has widely diverged from the original (which was never reviewed), so it might be best to look at all code as if it was new.

scripts/perf_tests.py

src/chronify/csv_io.py

lixiliu · 2024-10-28T16:20:25Z

src/chronify/duckdb/functions.py

+) -> DuckDBPyRelation:
+    """Add a datetime column to the relation."""
+    # TODO
+    raise NotImplementedError


Seems like we'll need two handling, one without tz-offset and one with.

src/chronify/models.py

src/chronify/time.py

lixiliu · 2024-10-28T17:00:16Z

src/chronify/time_configs.py

+            "If None, timestamps are timezone-naive.",
+        ),
+    ] = None
+    start: datetime  # TODO: what if the time zone is specified here?


While this is supported in other packages like pandas, IMO, it can be ambiguous here because:
If only offset is specified, we can only assume the same offset throughout, making it a standard time assumption.

That said, we can support that as a future feature?

This behavior is now different. There is no longer a time_zone field. The user has to specify the time zone, if any, in this field. Please raise any concerns.

src/chronify/time_series_checker.py

daniel-thom

I added support for what we discussed last Thursday. Essentially, we will store whatever the user passes: time zones or not. datetimes will be converted to strings when using SQLite, and so we have to implement custom code to read it back. I didn't get anything to work with the sqlalchemy DateTime type with timezone=True.

Now, all of the applicable tests run on DuckDB and SQLite. Spark will come later.

pyproject.toml

daniel-thom · 2024-10-28T16:59:28Z

.pre-commit-config.yaml

@@ -8,7 +8,3 @@ repos:
      args: [ --fix ]
    # Run the formatter.
    - id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy


@pesap My take is to remove it. Running it here doesn't use pyproject.toml. We would have to make other changes. The action for mypy does follow pyproject.toml.

If you pass language system it uses the mypy that you have installed from the environment

But that still doesn't respect settings in pyproject.toml. With the changes here, if the developer runs mypy in the terminal, the right things will happen. The CI job does that as a check. The downside is that pre-commit doesn't run mypy. We would have to add duplicate settings in the pre-commit config file.

src/chronify/time_configs.py

scripts/perf_tests.py

daniel-thom · 2024-10-28T18:37:34Z

src/chronify/time_configs.py

+            "If None, timestamps are timezone-naive.",
+        ),
+    ] = None
+    start: datetime  # TODO: what if the time zone is specified here?


This behavior is now different. There is no longer a time_zone field. The user has to specify the time zone, if any, in this field. Please raise any concerns.

src/chronify/models.py

daniel-thom · 2024-10-28T18:51:55Z

src/chronify/sqlalchemy/functions.py

+    """Read a database query into a Pandas DataFrame."""
+    df = pl.read_database(query, connection=conn).to_pandas()
+    config = schema.time_config
+    if config.needs_utc_conversion(conn.engine.name):


Note SQLite special cases here and in the next function.

Do we want to add pandas just to do the datetime conversion? Maybe we can just make simple function that uses bare python

I think we agreed that pandas is part of the public API. It is what the user gets when they call read_table. Numpy array is not a good option because the table will have mixed types. Arrow table is an option, but that's not what users want. I'm skeptical that people want a duckdb relation. Thoughts?

Pandas is fine with me.

src/chronify/time.py

pesap

LGTM. Nothing major to address only some comments.

pesap · 2024-10-28T22:36:38Z

.pre-commit-config.yaml

@@ -8,7 +8,3 @@ repos:
      args: [ --fix ]
    # Run the formatter.
    - id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy


If you pass language system it uses the mypy that you have installed from the environment

scripts/perf_tests.py

pesap · 2024-10-28T22:39:31Z

src/chronify/csv_io.py

+def read_csv(path: Path | str, schema: CsvTableSchema, **kwargs: Any) -> DuckDBPyRelation:
+    """Read a CSV file into a DuckDB relation."""
+    if schema.column_dtypes:
+        dtypes = {x.name: get_duckdb_type_from_sqlalchemy(x.dtype) for x in schema.column_dtypes}
+        rel = duckdb.read_csv(str(path), dtype=dtypes, **kwargs)
+    else:
+        rel = duckdb.read_csv(str(path), **kwargs)
+
+    expr = ",".join(rel.columns)
+    return duckdb.sql(f"SELECT {expr} FROM rel")


Shall we catch any errors or just let duckdb to trow and error for a badly format csv

DuckDB should have it covered.

pesap · 2024-10-28T22:42:24Z

src/chronify/sqlalchemy/functions.py

+    """Read a database query into a Pandas DataFrame."""
+    df = pl.read_database(query, connection=conn).to_pandas()
+    config = schema.time_config
+    if config.needs_utc_conversion(conn.engine.name):


Do we want to add pandas just to do the datetime conversion? Maybe we can just make simple function that uses bare python

pesap · 2024-10-28T22:43:30Z

src/chronify/time.py

@@ -169,15 +174,15 @@ def get_standard_time(tz: TimeZone) -> TimeZone:


 def get_prevailing_time(tz: TimeZone) -> TimeZone:


Can we add definition here of what is prevailing time?

daniel-thom · 2024-10-28T22:49:47Z

src/chronify/store.py

+            conn.commit()
+        self.update_table_schema()
+
+    def read_query(self, query: Selectable | str, schema: TableSchema) -> pd.DataFrame:


There is currently an unfortunate limitation here. You have to pass the schema because SQLite is forcing us to perform custom logic on the time columns. We don't know which columns are time columns yet. We need a second table to store that metadata. I propose that we defer that to a subsequent PR.

scripts/perf_tests.py

lixiliu · 2024-10-31T18:58:55Z

src/chronify/sqlalchemy/functions.py

+    """Read a database query into a Pandas DataFrame."""
+    df = pl.read_database(query, connection=conn).to_pandas()
+    config = schema.time_config
+    if config.needs_utc_conversion(conn.engine.name):


Pandas is fine with me.

codecov-commenter · 2024-10-31T20:54:25Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

daniel-thom added 7 commits October 3, 2024 16:59

Add support for ingesting CSV files

f1ad858

Add dependency for tzdata

9db8060

Remove global variable

bbb0597

Refactor table creation

f5e3f38

Fix loading of Parquet files

4c97868

Add pytz dependency

88622f6

Fix table insertion

321c5a5

daniel-thom force-pushed the feature/ingest-csv branch from bcbaa7f to a44922a Compare October 14, 2024 16:12

Use polars for database I/O

87ac971

daniel-thom force-pushed the feature/ingest-csv branch from a44922a to 87ac971 Compare October 14, 2024 16:13

daniel-thom added 5 commits October 17, 2024 09:40

Fix polars-sqlalchemy interactions

3b763a1

This fixes read from the database through Polars and sqlalchemy after recent fixes to the Polars Python package.

Fix mypy errors

8803850

Add test coverage

ecbed2e

Tighten Polars dependency

c350ad6

Fix use of interval_type

f7761b1

daniel-thom requested review from lixiliu and pesap October 18, 2024 00:27

daniel-thom added 3 commits October 23, 2024 18:32

Store timestamps without time zone, implicit UTC

927af4b

Bump Polars to v1.11

db08097

Set sqlalchemy dependency

a4a128e

Store timestamps with optional time zones

c2b125c

daniel-thom force-pushed the feature/ingest-csv branch 3 times, most recently from 76f02f5 to 32c7173 Compare October 28, 2024 17:14

Update mypy action

eb51d55

daniel-thom force-pushed the feature/ingest-csv branch from 32c7173 to eb51d55 Compare October 28, 2024 17:36

lixiliu reviewed Oct 28, 2024

View reviewed changes

Fix f-strings

43d21af

Add test for CSV file with time zones

13021e9

daniel-thom commented Oct 28, 2024

View reviewed changes

pesap approved these changes Oct 28, 2024

View reviewed changes

daniel-thom commented Oct 28, 2024

View reviewed changes

lixiliu approved these changes Oct 31, 2024

View reviewed changes

Fix performance test

8c43c69

daniel-thom merged commit ac69a9e into main Oct 31, 2024
6 checks passed

daniel-thom deleted the feature/ingest-csv branch October 31, 2024 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ingesting CSV files #2

Add support for ingesting CSV files #2

daniel-thom commented Oct 3, 2024 •

edited

Loading

lixiliu commented Oct 24, 2024

daniel-thom commented Oct 24, 2024

lixiliu Oct 28, 2024

lixiliu Oct 28, 2024

daniel-thom Oct 28, 2024

daniel-thom left a comment

daniel-thom Oct 28, 2024

pesap Oct 28, 2024

daniel-thom Oct 28, 2024

daniel-thom Oct 28, 2024

daniel-thom Oct 28, 2024

pesap Oct 28, 2024

daniel-thom Oct 28, 2024

lixiliu Oct 31, 2024

pesap left a comment

pesap Oct 28, 2024

pesap Oct 28, 2024

daniel-thom Oct 31, 2024

pesap Oct 28, 2024

pesap Oct 28, 2024

daniel-thom Oct 28, 2024

lixiliu Oct 31, 2024

codecov-commenter commented Oct 31, 2024

		@@ -169,15 +174,15 @@ def get_standard_time(tz: TimeZone) -> TimeZone:


		def get_prevailing_time(tz: TimeZone) -> TimeZone:

Add support for ingesting CSV files #2

Add support for ingesting CSV files #2

Conversation

daniel-thom commented Oct 3, 2024 • edited Loading

lixiliu commented Oct 24, 2024

daniel-thom commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-thom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pesap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 31, 2024

Welcome to Codecov 🎉

daniel-thom commented Oct 3, 2024 •

edited

Loading